MITB Banner

Google Transformer Model Reformer Works On A Single GPU & Is Memory Efficient

Share

Since its inception in 2017, transformer models have become popular among researchers and academia in the machine and deep learning sector. This deep machine learning model is used for various natural language processing (NLP) tasks such as language understanding, machine translation, among others. In one of our articles, we had discussed why transformers play such a crucial role in NLP development.

Massive transformer models have the capability to achieve state-of-the-art results on a number of tasks. However, training these models especially on long sequences can be a costly affair and can have massive computations. These large models when trained with model parallelism fails to be fine-tuned on a single GPU. 

To overcome the issue of cost and massive computations, recently, the researchers at tech giant Google and UC Berkely have introduced a new transformer model known as Reformer. The paper on Reformer has been accepted by the International Conference on Learning Representations (ICLR 2020) and is currently under review as a conference paper at the same. 

Behind the Model

The core idea behind this transformer model is self-attention — the ability to attend to different positions of the input sequence to compute a representation of that sequence. In a traditional transformer model, the memory in a model with N layers is N-times larger than in a single-layer model due to the fact that activations require to be stored for back-propagation. Also, the attention on the sequences of length L is O (L2) in both computational and memory complexity, which in result can exhaust the accelerator memory.

The researchers used the following methods while building the reformer model in order to improve the efficiency of the model as well as make it faster.

  • Reversible layers are being used to enable storing only a single copy of activations in the whole model, so the N factor disappears. This means, it allows storing activations only once in the training process instead of N times, where N is the number of layers. Thus, applying these reversible residuals instead of the standard ones does change the model but has a negligible effect on training in all configurations.
  • Splitting activations inside feed-forward layers and processing them in chunks ends up removing the (d)fffactor and saves memory inside feed-forward layers. Splitting activations only affect the implementation and is numerically identical to the layers used in the Transformer.
  • Approximate attention computation based on locality-sensitive hashing replaces the O (L2) factor in attention layers with O (L) allows operating on long sequences. In simple words, the researchers replaced dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(L log L), where L is the length of the sequence.

The researchers at the tech giant further trained the model with up to 20-layer big reformers on enwik8 and imagenet64 dataset in order to verify that the model can indeed fit large models on a single core and train fast on long sequences. Reformer combines the modelling capacity of a transformer with an architecture that can be executed efficiently on long sequences and with small memory use even for models with a large number of layers.

Applications

The ability to handle long sequences opens the way for the use of the Reformer on many complex generative tasks. Also, in addition to generating very long coherent text, the Reformer model has the capability to bring the power of transformer models to other domains like time-series forecasting, music, image and video generation.

Wrapping Up

Using the reversible residuals instead of the standard residuals not only made Reformer possible to perform faster but also shows higher memory efficiency than other transformer models while using only a single GPU. The motive behind this model is to help large, richly-parameterised transformer models become more widespread and accessible.  

Read the paper here.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.