Google Transformer Model Reformer Works On A Single GPU & Is Memory Efficient

Since its inception in 2017, transformer models have become popular among researchers and academia in the machine and deep learning sector. This deep machine learning model is used for various natural language processing (NLP) tasks such as language understanding, machine translation, among others. In one of our articles, we had discussed why transformers play such a crucial role in NLP development.

Massive transformer models have the capability to achieve state-of-the-art results on a number of tasks. However, training these models especially on long sequences can be a costly affair and can have massive computations. These large models when trained with model parallelism fails to be fine-tuned on a single GPU. 


Sign up for your weekly dose of what's up in emerging technology.

To overcome the issue of cost and massive computations, recently, the researchers at tech giant Google and UC Berkely have introduced a new transformer model known as Reformer. The paper on Reformer has been accepted by the International Conference on Learning Representations (ICLR 2020) and is currently under review as a conference paper at the same. 

Behind the Model

The core idea behind this transformer model is self-attention — the ability to attend to different positions of the input sequence to compute a representation of that sequence. In a traditional transformer model, the memory in a model with N layers is N-times larger than in a single-layer model due to the fact that activations require to be stored for back-propagation. Also, the attention on the sequences of length L is O (L2) in both computational and memory complexity, which in result can exhaust the accelerator memory.

The researchers used the following methods while building the reformer model in order to improve the efficiency of the model as well as make it faster.

  • Reversible layers are being used to enable storing only a single copy of activations in the whole model, so the N factor disappears. This means, it allows storing activations only once in the training process instead of N times, where N is the number of layers. Thus, applying these reversible residuals instead of the standard ones does change the model but has a negligible effect on training in all configurations.
  • Splitting activations inside feed-forward layers and processing them in chunks ends up removing the (d)fffactor and saves memory inside feed-forward layers. Splitting activations only affect the implementation and is numerically identical to the layers used in the Transformer.
  • Approximate attention computation based on locality-sensitive hashing replaces the O (L2) factor in attention layers with O (L) allows operating on long sequences. In simple words, the researchers replaced dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(L log L), where L is the length of the sequence.

The researchers at the tech giant further trained the model with up to 20-layer big reformers on enwik8 and imagenet64 dataset in order to verify that the model can indeed fit large models on a single core and train fast on long sequences. Reformer combines the modelling capacity of a transformer with an architecture that can be executed efficiently on long sequences and with small memory use even for models with a large number of layers.


The ability to handle long sequences opens the way for the use of the Reformer on many complex generative tasks. Also, in addition to generating very long coherent text, the Reformer model has the capability to bring the power of transformer models to other domains like time-series forecasting, music, image and video generation.

Wrapping Up

Using the reversible residuals instead of the standard residuals not only made Reformer possible to perform faster but also shows higher memory efficiency than other transformer models while using only a single GPU. The motive behind this model is to help large, richly-parameterised transformer models become more widespread and accessible.  

Read the paper here.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM