Over the last few years, transformers like BERT, XLNet, RoBERTa, GPT-3, etc., have extensively been used in many natural language processing (NLP) advances. While these transformers have produced a significantly better performance with scale, the costs of training large models have become increasingly expensive.
For instance, the cost of training an XLNet (BERT alternative) model is about $245,000. This cost is based on a resource breakdown provided in the paper, where the researchers have trained XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimiser, linear learning rate decay, and a batch size of 2048, which took them about 2.5 days. On the other hand, GPT-3 with 175 billion parameters, which exceeds a memory size of 350 GB upon training, is estimated to cost close to $12 million — 200x the price of GPT2.
To give you a ballpark estimate, in a paper titled ‘The cost of training NLP models,’ the researchers noted that for:
- 110 million parameters — the cost ranges between $2.5K to $50K
- 340 million parameters — the cost ranges between $10K to $200K
- 1.5 billion parameters — the cost ranges between $80K to $1.6 million
To reduce the training cost of transformer language models, Google proposed searching for a more efficient variant/alternatives to the transformer by modifying its TensorFlow computation graph. As a result, the team identified Primer (PRIMitives searched transformER), architecture with a smaller training cost than the original transformer and other variants for auto-regressive language modelling.
Further, the experiments show that Primer has the benefits of
- Achieving a target quality using smaller training cost
- Achieving higher quality given a fixed training cost
- Achieving a target quality using a smaller inference cost
These benefits are massive and hold across model sizes (20 million to 1.9 billion parameters), compute scale (10 to 105 accelerator hours), datasets (LM1B, C4, PG19), hardware platforms (TPUv2, TPUv3, TPUv4 and V100), multiple transformer codebases using default configurations (Tensor2Tensor, T5 and Lingvo), multiple model families (sparse mixture-of-experts Switch Transformers, dense transformers, and synthesisers).
The researchers also found that the computer savings of Primer over transformers increase as training cost grows when controlling for model size and quality. When using optimally sized models, these savings follow a power law with respect to quality.
The source code for Primer is available on GitHub.
Here’s how it reduces the cost
In a bid to demonstrate Primer’s savings in an established training setup, the researchers compared 500 million parameter Primer to the original T5 architecture, using the exact configuration used by Raffel et al. applied to auto-regressive language modelling. The outcome, Primer, achieved an improvement of 0.9 perplexity given the same training cost and research quality parity with the T5 baseline models using 4.2x less compute.
Further, the researchers demonstrated that Primer’s savings transfer to one-shot evaluations by comparing Primer to the transformer at 1.9 billion parameters in a setup similar to GPT-3 XL. Here, Primer achieved similar performance to the transformer on both pretraining perplexity and downstream one-shot tasks, using 3x less training compute.
Primer’s improvements can be attributed to two simple modifications — squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. The researchers said that these modifications are simple and can be dropped into existing transformer codebases to obtain significant gains for auto-regressive modelling. They call the model with just these two modifications as Primer-EZ.
The image shown below depicts the modifications. The blue indicates portions of the original transformer, and red signifies one of their proposed modifications.
Primer surely looks positive, but it has limitations. Firstly, their model parameters sweeps are approximately an order of magnitude smaller than those performed in the original study by Kaplan et al. (Scaling Laws for Neural Language Models). Although their large-scale models use a significant amount of computing, they are still orders of magnitude smaller than SOTA models, such as the full-scale GPT-3 (175 billion parameters).
Another setback is that they focus primarily on decoder-only models, while encoder-only (BERT, XLNet and RoBERTa) and encoder-decoder sequence models are still widely used. In this study, the researchers perform encoder-decoder masked language modelling comparisons in T5 but do not study the results in significant depth. The main finding is that, although Primer modifications improve upon vanilla transformers, they perform as Transformer++.
In other words, this result suggests that architectural modifications that work well for decoder-only auto-regressive language models may not necessarily be as effective for encoder-based masked models. Google researchers said that developing an architecture that works well for masked language models is a topic of future research.
The Future of Transformers, reducing cost
Google researchers believe that, in practice, additional tuning could further improve their performance. With this study, the team looks to encourage more research into the development of efficient transformers. For example, an important finding of this study is that small changes to activation functions can result in more efficient training.
“In the effort to reduce the cost of transformers, more investment in the development of such simple changes could be a promising area for future exploration,” said the researchers.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Amit Raja Naik is a senior writer at Analytics India Magazine, where he dives deep into the latest technology innovations. He is also a professional bass player.