Since its inception in 2017, transformer models have become popular among researchers and academia in the machine and deep learning sector. This deep machine learning model is used for various natural language processing (NLP) tasks such as language understanding, machine translation, among others. In one of our articles, we had discussed why transformers play such a crucial role in NLP development.
Massive transformer models have the capability to achieve state-of-the-art results on a number of tasks. However, training these models especially on long sequences can be a costly affair and can have massive computations. These large models when trained with model parallelism fails to be fine-tuned on a single GPU.
To overcome the issue of cost and massive computations, recently, the researchers at tech giant Google and UC Berkely have introduced a new transformer model known as Reformer. The paper on Reformer has been accepted by the International Conference on Learning Representations (ICLR 2020) and is currently under review as a conference paper at the same.
Behind the Model
The core idea behind this transformer model is self-attention — the ability to attend to different positions of the input sequence to compute a representation of that sequence. In a traditional transformer model, the memory in a model with N layers is N-times larger than in a single-layer model due to the fact that activations require to be stored for back-propagation. Also, the attention on the sequences of length L is O (L2) in both computational and memory complexity, which in result can exhaust the accelerator memory.
The researchers used the following methods while building the reformer model in order to improve the efficiency of the model as well as make it faster.
- Reversible layers are being used to enable storing only a single copy of activations in the whole model, so the N factor disappears. This means, it allows storing activations only once in the training process instead of N times, where N is the number of layers. Thus, applying these reversible residuals instead of the standard ones does change the model but has a negligible effect on training in all configurations.
- Splitting activations inside feed-forward layers and processing them in chunks ends up removing the (d)fffactor and saves memory inside feed-forward layers. Splitting activations only affect the implementation and is numerically identical to the layers used in the Transformer.
- Approximate attention computation based on locality-sensitive hashing replaces the O (L2) factor in attention layers with O (L) allows operating on long sequences. In simple words, the researchers replaced dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(L log L), where L is the length of the sequence.
The researchers at the tech giant further trained the model with up to 20-layer big reformers on enwik8 and imagenet64 dataset in order to verify that the model can indeed fit large models on a single core and train fast on long sequences. Reformer combines the modelling capacity of a transformer with an architecture that can be executed efficiently on long sequences and with small memory use even for models with a large number of layers.
The ability to handle long sequences opens the way for the use of the Reformer on many complex generative tasks. Also, in addition to generating very long coherent text, the Reformer model has the capability to bring the power of transformer models to other domains like time-series forecasting, music, image and video generation.
Using the reversible residuals instead of the standard residuals not only made Reformer possible to perform faster but also shows higher memory efficiency than other transformer models while using only a single GPU. The motive behind this model is to help large, richly-parameterised transformer models become more widespread and accessible.
Read the paper here.