Google GLaM Vs DeepMind Gopher: Who Wins The Large Language Model Race

With Gopher and GLaM introduced back to back, let’s see how they fare against each other

2021 has been a transformational year for large language models, and it is getting more and more intense. A day after innovation leader DeepMind came out with Gopher, a 280 billion parameter transformer language model, tech mammoth Google introduced the Generalist Language Model (GLaM) – a trillion weight model that uses sparsity. The full version of GLaM has 1.2T total parameters across 64 experts per mixture of experts (MoE) layer with 32 MoE layers in total but only activates a subnetwork of 97B (8% of 1.2T) parameters per token prediction during inference.

After the release of the path-breaking GPT-3 autoregressive language model with 175 billion machine learning parameters from Open AI in 2020, tech giants have all stepped up and released their large language model versions to keep up with the competition. AI21 Labs released Jurassic-1, which has 178 billion parameters. Microsoft and NVIDIA went a step further and introduced the Megatron-Turing Natural Language Generation (MT-NLG) model with an astounding 530 billion parameters. Even in the early part of this year, Google released Switch Transformers, a technique to train language models with over a trillion parameters.


Sign up for your weekly dose of what's up in emerging technology.

Now with Gopher and GLaM introduced, back to back comparisons between the two are bound to happen. Let us look at how they fare against each other. 

GLaM serves efficiently in terms of computation and energy use

Training and serving large language models can be computationally intensive. Google says that the GLaM model can be trained and served efficiently in terms of computation and energy use (due to sparsity). It achieves competitive performance on multiple few-shot learning tasks. GLaM is a mixture of experts (MoE) models with different submodels (or experts) specialized for different inputs.

Trained on MassiveText

Gopher, on the other hand, is trained on MassiveText (a collection of large English-language text datasets from web pages, books, news articles, and code). It says that the pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. DeepMind found out that successive stages of this pipeline improve language model downstream performance. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

Difference in their dataset and architecture


Feedforward network replaced with MoE layer

As per Google, for GLaM, they built a high-quality 1.6 trillion 

token dataset that has language usage representative of a diverse range of downstream use-cases for the model. We all know that web pages consist of a lot of unlabelled data with quality ranging from low-quality comments to professional writing.

Google AI has developed a text quality filter trained on a collection of text from Wikipedia and books that determines the quality of the content for a webpage. They applied the filter to generate the final subset of web pages. It was combined with books and Wikipedia to create the final training dataset. Each input token is dynamically routed to two selected expert networks out of 64 for prediction. 

Google says that they replaced the single feedforward network with an MoE layer. Even though this MoE layer has many more parameters, the experts are sparsely activated. For a given input token, only two experts are used, giving the model more capacity while limiting computation.

The final learned representation of a token will be the weighted combination of the outputs from the two experts. This allows different experts to activate different types of inputs. 

Image: Google (The architecture of GLaM where each input token is dynamically routed to two selected expert networks out of 64 for prediction)


Content filtering, text extraction, quality filtering, and repetition removal

In the MassiveText subsets, the non-English documents are filtered out, and the data is processed into a homogeneous text-only format. The documents are deduplicated, and the research team filtered out documents too similar to those in their test sets.

For MassiveWeb curated by them, the team obtained web data in a text-only format using a custom HTML scraper and then applied an extra filter to remove explicit content at the initial stages. Then, they applied simple heuristics to filter out low-quality text.

Image: Data processing stages (Scaling Language Models: Methods, Analysis & Insights from Training Gopher)

  • Content Filtering – The non-English documents are filtered. Pages from MassiveWeb that do not pass Google’s SafeSearch filter are also removed. 
  • Text Extraction (MassiveWeb) – The text from web pages are extracted using the tree structure of the HTML markup. DeepMind said, “We extract text from web pages using the tree structure of the HTML markup. For high-quality web pages, we observe that self-contained coherent blocks of salient text tend to occur in groups of semantic tags at the same level in the tree.” These tags are converted to plain text. This gives a huge volume of text documents.
  • Quality Filtering (MassiveWeb) – A huge chunk of the web has social media content, which can variously lack context and be of low quality. Through filters, DeepMind removes any document that does not contain between 50 and 100,000 words, or whose mean word length is outside the range of 3 to 10 characters or any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the ellipsis. 

They also remove any document with more than 90% of lines starting with a bullet point or more than 30% ending with an ellipsis. It is also required that 80% of words in a document contain at least one alphabetic character.

  • Repetition Removal (MassiveWeb) – They remove documents with a high proportion of repeated lines, paragraphs, or n-grams. They also remove documents containing many short duplicate passages and ones with fewer, larger sections of duplicate content.

Gopher, GLaM Vs Other Large Language Models


DeepMind’s research went on to say that Gopher almost halves the accuracy gap from GPT-3 to human expert performance and exceeds forecaster expectations. It stated that Gopher lifts performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results. This works notably in knowledge-intensive domains like fact-checking and general knowledge.

DeepMind’s Gopher does not outperform the state-of-the-art on 8 of 19 tasks; under-performs on Ubuntu IRC and DM Mathematics in particular. This may be due to a poor tokenizer representation for numbers. Gopher demonstrates improved modelling on 11 of 19 tasks, in particular books and articles. It is said that this can happen due to the heavy use of book data in MassiveText (sampling proportion of 27% compared to 16% in GPT-3).

In the research paper, DeepMind tries to draw a comparison between the models that exist and Gopher. It said that Gopher outperforms the current state-of-the-art for 100 tasks (81% of all tasks). The baseline model includes large language models such as GPT-3 (175 billion parameters), Jurassic-1 (178B parameters), and Megatron-Turing NLG (530 billion parameters). They found that Gopher showed the most uniform improvement across reading comprehension, humanities, ethics, STEM and medicine categories. It showed a general improvement in fact-checking.

Image: DeepMind’s Scaling Language Models: Methods, Analysis & Insights from Training Gopher


GLaM’s performance compares favourably to GPT-3 (175B), with significantly improved learning efficiency across 29 public NLP benchmarks in seven categories. This covers language completion, open-domain question answering, and natural language inference tasks.

Compared with the  Megatron-Turing model, GLaM is on-par on the seven respective tasks if using a 5% margin while using 5x less computation during inference.

Image: Google (Average score for GLaM and GPT-3 on NLG (left) and NLU (right) tasks (higher is better).

Image: Google

With the two models from two big names coming out back to back, seeing how they perform when deployed across sectors will be interesting. Competitors are surely taking notice, and we can expect more such large language models to get released in the near future.

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.