Google Introduces New Architecture To Reduce Cost Of Transformers

Primer’s improvements can be attributed to two simple modifications -- squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention.
Google Introduces New Architecture To Reduce Cost Of Transformers

Over the last few years, transformers like BERT, XLNet, RoBERTa, GPT-3, etc., have extensively been used in many natural language processing (NLP) advances. While these transformers have produced a significantly better performance with scale, the costs of training large models have become increasingly expensive.  

For instance, the cost of training an XLNet (BERT alternative) model is about $245,000. This cost is based on a resource breakdown provided in the paper, where the researchers have trained XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimiser, linear learning rate decay, and a batch size of 2048, which took them about 2.5 days. On the other hand, GPT-3 with 175 billion parameters, which exceeds a memory size of 350 GB upon training, is estimated to cost close to $12 million — 200x the price of GPT2. 

To give you a ballpark estimate, in a paper titled ‘The cost of training NLP models,’ the researchers noted that for:

  • 110 million parameters — the cost ranges between $2.5K to $50K 
  • 340 million parameters — the cost ranges between $10K to $200K
  • 1.5 billion parameters — the cost ranges between $80K to $1.6 million 

Introducing Primer 

To reduce the training cost of transformer language models, Google proposed searching for a more efficient variant/alternatives to the transformer by modifying its TensorFlow computation graph. As a result, the team identified Primer (PRIMitives searched transformER), architecture with a smaller training cost than the original transformer and other variants for auto-regressive language modelling. 

Further, the experiments show that Primer has the benefits of 

  • Achieving a target quality using smaller training cost 
  • Achieving higher quality given a fixed training cost 
  • Achieving a target quality using a smaller inference cost 

These benefits are massive and hold across model sizes (20 million to 1.9 billion parameters), compute scale (10 to 105 accelerator hours), datasets (LM1B, C4, PG19), hardware platforms (TPUv2, TPUv3, TPUv4 and V100), multiple transformer codebases using default configurations (Tensor2Tensor, T5 and Lingvo), multiple model families (sparse mixture-of-experts Switch Transformers, dense transformers, and synthesisers). 

The researchers also found that the computer savings of Primer over transformers increase as training cost grows when controlling for model size and quality. When using optimally sized models, these savings follow a power law with respect to quality. 

The source code for Primer is available on GitHub

Here’s how it reduces the cost 

In a bid to demonstrate Primer’s savings in an established training setup, the researchers compared 500 million parameter Primer to the original T5 architecture, using the exact configuration used by Raffel et al. applied to auto-regressive language modelling. The outcome, Primer, achieved an improvement of 0.9 perplexity given the same training cost and research quality parity with the T5 baseline models using 4.2x less compute. 

Further, the researchers demonstrated that Primer’s savings transfer to one-shot evaluations by comparing Primer to the transformer at 1.9 billion parameters in a setup similar to GPT-3 XL. Here, Primer achieved similar performance to the transformer on both pretraining perplexity and downstream one-shot tasks, using 3x less training compute. 

Primer’s improvements can be attributed to two simple modifications — squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. The researchers said that these modifications are simple and can be dropped into existing transformer codebases to obtain significant gains for auto-regressive modelling. They call the model with just these two modifications as Primer-EZ. 

The image shown below depicts the modifications. The blue indicates portions of the original transformer, and red signifies one of their proposed modifications. 

Google Introduces New Architecture To Reduce Cost Of Transformers
(Source: arXiv)


Primer surely looks positive, but it has limitations. Firstly, their model parameters sweeps are approximately an order of magnitude smaller than those performed in the original study by Kaplan et al. (Scaling Laws for Neural Language Models). Although their large-scale models use a significant amount of computing, they are still orders of magnitude smaller than SOTA models, such as the full-scale GPT-3 (175 billion parameters). 

Another setback is that they focus primarily on decoder-only models, while encoder-only (BERT, XLNet and RoBERTa) and encoder-decoder sequence models are still widely used. In this study, the researchers perform encoder-decoder masked language modelling comparisons in T5 but do not study the results in significant depth. The main finding is that, although Primer modifications improve upon vanilla transformers, they perform as Transformer++. 

In other words, this result suggests that architectural modifications that work well for decoder-only auto-regressive language models may not necessarily be as effective for encoder-based masked models. Google researchers said that developing an architecture that works well for masked language models is a topic of future research. 

The Future of Transformers, reducing cost 

Google researchers believe that, in practice, additional tuning could further improve their performance. With this study, the team looks to encourage more research into the development of efficient transformers. For example, an important finding of this study is that small changes to activation functions can result in more efficient training. 

“In the effort to reduce the cost of transformers, more investment in the development of such simple changes could be a promising area for future exploration,” said the researchers. 

Download our Mobile App

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?