Why Transformers Are Increasingly Becoming As Important As RNN And CNN?

Google AI unveiled a new neural network architecture called Transformer in 2017. The GoogleAI team had claimed the Transformer worked better than leading approaches such as recurrent neural networks and convolutional models on translation benchmarks.

In four years, Transformer has become the talk of the town: A big part of the credit goes to its self-attention mechanism, which helps models to focus on only certain parts of the input and reason more effectively. BERT and GPT-3 are some popular Transformers.

Now, the looming question is: With Transformer adoption on the rise, could it surpass or become as popular as RNN and CNN?

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

What Is A Transformer?

As per the original 2017 paper, titled ‘Attention Is All You Need’, Transformers perceives the entire input sequences simultaneously. It depends on transforming one sequence into another, like the other usual sequence-to-sequence models, plus employing the attention mechanism.

Also Read:


Download our Mobile App



In NLP models, the attention mechanism considers the relationship between words, irrespective of where they are placed in a sentence. A Transformer performs a small but constant number of empirically chosen steps. At each step, it applies a relationship between all the words in the sentence using the self-attention mechanism. To compute the representation for a given word, Transformer compares it with every other word in the sentence.

Attention Mechanism for English To French Text Translation (Credit: Trungtran.io)

Comparison With Other Methods

Before the introduction of Transformer, most state-of-the-art NLP models were based on RNN. RNN processes data sequentially — word by word to access the cell of the last word. RNN is not very efficient in handling long sequences. The model tends to forget the contents of the distant position or, in some cases, mixes the contents of adjacent positions: the more the steps, the more challenging for the recurrent network to make decisions. The sequential nature of RNNs makes it further difficult to take full advantage of modern fast computing devices such as TPUs and GPUs.

The Long Short Term Memory (LSTM) offers a slight improvement over conventional RNN. LSTM leverages the Gate mechanism to determine which information the cell needs to remember and which to forget. It can also eliminate the vanishing gradient problem that RNN suffers from. LSTM is good but not good enough. Like RNN, LSTM cannot be trained in parallel.

Multilayer Perceptrons (MLP) is a basic neural network, which was highly popular in the 1980s. However, it has been outdated for any heavy lifting compared to networks such as CNN or RNN.

Convolutional Neural Network has an advantage over RNNs (and LSTMs) as they are easy to parallelise. CNNs find wide application in NLP as they are fast to train and are effective with shorter sentences. It captures dependencies among all the possible combinations of words. However, in long sentences, capturing the dependencies among different combinations of words can be cumbersome and unpractical.

Transformer avoids recursion by processing sentences as whole using attention mechanisms and positional embeddings. Newer models such as Transformer-XL can overcome fixed input size issues as well.

Transformer Use Cases

GPT-3: Generative Pretrained Transformer-3 (GPT-3) was one of the most significant breakthroughs in 2020. GPT-3 is a third-generation language prediction model in the GPT-n series from OpenAI. With 75 billion machine learning parameters, GPT-3 broke the record of Microsoft’s Turing NLG, which was the largest language model (with 17 billion parameters) until then. 

GPT-2: GPT-2, released in 2019, is a large-transformer based language model with 1.5 billion parameters, at least ten times more parameters than the previous GPT model. GPT-2 is trained on a dataset of 8 million web pages to ‘predict the next word, given all of the previous words within some text’.

BERT: In 2018, Google open-sourced an NLP pre-training technique called Bidirectional Encoder Representations from Transformers (BERT). It was built on previous works such as semi-supervised sequence learning, ELMo, ULMFit, and Generative Pre-Training. BERT got state-of-the-art results on a range of NLP tasks.

Interestingly, NLP startup Hugging Face has a library called Transformers. It provides state-of-the-art general-purpose architectures for Natural Language Understanding and Natural Language Generation with deep interoperability between TensorFlow 2.0 and PyTorch.

Wrapping Up

Since 2017, researchers have introduced many modifications to Transformer. However, a recent study from Google Research found that most of these modifications did not improve its performance. This is also the reason why most modifications to the Transformer have not seen widespread adoption.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.