MITB Banner

Meet Linformer: The First Ever Linear-Time Transformer Architecture By Facebook

Share

Recently, researchers from Facebook AI introduced a Transformer architecture, that is known to be with more memory as well as time-efficient, called Linformer. According to the researchers, Linformer is the first theoretically proven linear-time Transformer architecture.

For a few years now, the number of parameters in Natural Language Processing (NLP) transformers has grown drastically, from the original 340 million parameters introduced in BERT-Large to 175 billion parameters in GPT-3 models. The Transformer models have been applied into a wide variety of applications such as machine translation, question answering, text classification and other such. 

Why Linformer?

Training and deploying large Transformer models can be prohibitively costly for long sequences. The key efficiency bottleneck in standard Transformer models is the model’s self-attention mechanism. 

In the self-attention mechanism, the representation of each token is updated by visiting all the other tokens that are present in the previous layer. This operation is essential for retaining long-term information. It also provides Transformers with the edge over recurrent models on long sequences. However, attending to all the tokens at each layer provokes a complexity of O(n2) with respect to the sequence length.

This is the reason why the researchers introduced the Linformer architecture and demonstrated that the self-attention mechanism could be approximated by a low-rank matrix. The Linformer architecture reduces the overall self-attention complexity from O(n2) to O(n) in both time and space.

The Tech Behind Linformer

Linformer is a Transformer architecture for tackling the self-attention bottleneck in Transformers. It reduces self-attention to an O(n) operation in both space- and time complexity. It is a new self-attention mechanism which allows the researchers to compute the contextual mapping in linear time and memory complexity with respect to the sequence length.

The researchers first compared the pretraining performance of the proposed architecture against RoBERTa, a popular Transformer model by Facebook AI. For pretraining, they used BookCorpus dataset along with English Wikipedia, which includes 3300 million words. Also, all the models are pre-trained with the masked-language-modelling (MLM) objective, and the training for all experiments are parallelised across 64 Tesla V100 GPUs with 250k updates.

Furthermore, the researchers fine-tuned the Linformer architecture on popular datasets like IMDB and SST-2(sentiment classification), as well as QNLI (natural language inference) and QQP (textual similarity).

The researchers observed that the Linformer Transformer model with values n = 512 and k = 128 has the comparable downstream performance to the popular RoBERTa model. It is even noticed that Linformer slightly outperform it at k = 256. Moreover, although the layerwise sharing strategy of the Linformer Transformer shares a single projection matrix across the entire model, it actually performs the best accuracy outcome of all the three-parameter sharing strategies.

They also added that the Linformer pretrained with longer sequence length has similar results to the one pretrained with the shorter length. As a result, it empirically establishes the notion that the performance of the Linformer model is largely determined by the projected dimension k instead of the ratio n/k.

Applications of Linformer

Currently, Facebook AI Research (FAIR) lab is using Linformer to analyse billions of pieces of content on Facebook as well as on Instagram in different regions around the globe. In a blog post, developers stated that this Transformer architecture had assisted them in making steady progress in catching hate speech and content that incites violence. 

The researchers developed the Linformer architecture with an aim to unlock the capabilities of powerful AI models, such as RoBERTa, BERT, XLM-R, among others. The architecture makes it possible to use them efficiently at scale.

According to the researchers, the main reason to use Linformer is that, with the standard Transformers, the volume of required processing power rises at a geometric rate as the input length increases. However, with Linformer, the number of computations increases only at a linear rate. This, in result, makes it possible to utilise the larger pieces of text to train models, thus achieve better performance.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.