Google Introduces New Architecture To Reduce Cost Of Transformers

Primer’s improvements can be attributed to two simple modifications -- squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention.
Google Introduces New Architecture To Reduce Cost Of Transformers

Over the last few years, transformers like BERT, XLNet, RoBERTa, GPT-3, etc., have extensively been used in many natural language processing (NLP) advances. While these transformers have produced a significantly better performance with scale, the costs of training large models have become increasingly expensive.  

For instance, the cost of training an XLNet (BERT alternative) model is about $245,000. This cost is based on a resource breakdown provided in the paper, where the researchers have trained XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimiser, linear learning rate decay, and a batch size of 2048, which took them about 2.5 days. On the other hand, GPT-3 with 175 billion parameters, which exceeds a memory size of 350 GB upon training, is estimated to cost close to $12 million — 200x the price of GPT2. 

To give you a ballpark estimate, in a paper titled ‘The cost of training NLP models,’ the researchers noted that for:

  • 110 million parameters — the cost ranges between $2.5K to $50K 
  • 340 million parameters — the cost ranges between $10K to $200K
  • 1.5 billion parameters — the cost ranges between $80K to $1.6 million 

Introducing Primer 

To reduce the training cost of transformer language models, Google proposed searching for a more efficient variant/alternatives to the transformer by modifying its TensorFlow computation graph. As a result, the team identified Primer (PRIMitives searched transformER), architecture with a smaller training cost than the original transformer and other variants for auto-regressive language modelling. 

Further, the experiments show that Primer has the benefits of 

  • Achieving a target quality using smaller training cost 
  • Achieving higher quality given a fixed training cost 
  • Achieving a target quality using a smaller inference cost 

These benefits are massive and hold across model sizes (20 million to 1.9 billion parameters), compute scale (10 to 105 accelerator hours), datasets (LM1B, C4, PG19), hardware platforms (TPUv2, TPUv3, TPUv4 and V100), multiple transformer codebases using default configurations (Tensor2Tensor, T5 and Lingvo), multiple model families (sparse mixture-of-experts Switch Transformers, dense transformers, and synthesisers). 

The researchers also found that the computer savings of Primer over transformers increase as training cost grows when controlling for model size and quality. When using optimally sized models, these savings follow a power law with respect to quality. 

The source code for Primer is available on GitHub

Here’s how it reduces the cost 

In a bid to demonstrate Primer’s savings in an established training setup, the researchers compared 500 million parameter Primer to the original T5 architecture, using the exact configuration used by Raffel et al. applied to auto-regressive language modelling. The outcome, Primer, achieved an improvement of 0.9 perplexity given the same training cost and research quality parity with the T5 baseline models using 4.2x less compute. 

Further, the researchers demonstrated that Primer’s savings transfer to one-shot evaluations by comparing Primer to the transformer at 1.9 billion parameters in a setup similar to GPT-3 XL. Here, Primer achieved similar performance to the transformer on both pretraining perplexity and downstream one-shot tasks, using 3x less training compute. 

Primer’s improvements can be attributed to two simple modifications — squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. The researchers said that these modifications are simple and can be dropped into existing transformer codebases to obtain significant gains for auto-regressive modelling. They call the model with just these two modifications as Primer-EZ. 

The image shown below depicts the modifications. The blue indicates portions of the original transformer, and red signifies one of their proposed modifications. 

Google Introduces New Architecture To Reduce Cost Of Transformers
(Source: arXiv)


Primer surely looks positive, but it has limitations. Firstly, their model parameters sweeps are approximately an order of magnitude smaller than those performed in the original study by Kaplan et al. (Scaling Laws for Neural Language Models). Although their large-scale models use a significant amount of computing, they are still orders of magnitude smaller than SOTA models, such as the full-scale GPT-3 (175 billion parameters). 

Another setback is that they focus primarily on decoder-only models, while encoder-only (BERT, XLNet and RoBERTa) and encoder-decoder sequence models are still widely used. In this study, the researchers perform encoder-decoder masked language modelling comparisons in T5 but do not study the results in significant depth. The main finding is that, although Primer modifications improve upon vanilla transformers, they perform as Transformer++. 

In other words, this result suggests that architectural modifications that work well for decoder-only auto-regressive language models may not necessarily be as effective for encoder-based masked models. Google researchers said that developing an architecture that works well for masked language models is a topic of future research. 

The Future of Transformers, reducing cost 

Google researchers believe that, in practice, additional tuning could further improve their performance. With this study, the team looks to encourage more research into the development of efficient transformers. For example, an important finding of this study is that small changes to activation functions can result in more efficient training. 

“In the effort to reduce the cost of transformers, more investment in the development of such simple changes could be a promising area for future exploration,” said the researchers. 

More Great AIM Stories

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Yugesh Verma
A guide to explainable named entity recognition

Named entity recognition (NER) is difficult to understand how the process of NER worked in the background or how the process is behaving with the data, it needs more explainability. we can make it more explainable.

Yugesh Verma
10 real-life applications of Genetic Optimization

Genetic algorithms have a variety of applications, and one of the basic applications of genetic algorithms can be the optimization of problems and solutions. We use optimization for finding the best solution to any problem. Optimization using genetic algorithms can be considered genetic optimization

Yugesh Verma
How to Visualize Backpropagation in Neural Networks?

The backpropagation algorithm computes the gradient of the loss function with respect to the weights. these algorithms are complex and visualizing backpropagation algorithms can help us in understanding its procedure in neural network.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM