Why Transformers Are Increasingly Becoming As Important As RNN And CNN?

Google AI unveiled a new neural network architecture called Transformer in 2017. The GoogleAI team had claimed the Transformer worked better than leading approaches such as recurrent neural networks and convolutional models on translation benchmarks.

In four years, Transformer has become the talk of the town: A big part of the credit goes to its self-attention mechanism, which helps models to focus on only certain parts of the input and reason more effectively. BERT and GPT-3 are some popular Transformers.

Now, the looming question is: With Transformer adoption on the rise, could it surpass or become as popular as RNN and CNN?


Sign up for your weekly dose of what's up in emerging technology.

What Is A Transformer?

As per the original 2017 paper, titled ‘Attention Is All You Need’, Transformers perceives the entire input sequences simultaneously. It depends on transforming one sequence into another, like the other usual sequence-to-sequence models, plus employing the attention mechanism.

Also Read:

In NLP models, the attention mechanism considers the relationship between words, irrespective of where they are placed in a sentence. A Transformer performs a small but constant number of empirically chosen steps. At each step, it applies a relationship between all the words in the sentence using the self-attention mechanism. To compute the representation for a given word, Transformer compares it with every other word in the sentence.

Attention Mechanism for English To French Text Translation (Credit: Trungtran.io)

Comparison With Other Methods

Before the introduction of Transformer, most state-of-the-art NLP models were based on RNN. RNN processes data sequentially — word by word to access the cell of the last word. RNN is not very efficient in handling long sequences. The model tends to forget the contents of the distant position or, in some cases, mixes the contents of adjacent positions: the more the steps, the more challenging for the recurrent network to make decisions. The sequential nature of RNNs makes it further difficult to take full advantage of modern fast computing devices such as TPUs and GPUs.

The Long Short Term Memory (LSTM) offers a slight improvement over conventional RNN. LSTM leverages the Gate mechanism to determine which information the cell needs to remember and which to forget. It can also eliminate the vanishing gradient problem that RNN suffers from. LSTM is good but not good enough. Like RNN, LSTM cannot be trained in parallel.

Multilayer Perceptrons (MLP) is a basic neural network, which was highly popular in the 1980s. However, it has been outdated for any heavy lifting compared to networks such as CNN or RNN.

Convolutional Neural Network has an advantage over RNNs (and LSTMs) as they are easy to parallelise. CNNs find wide application in NLP as they are fast to train and are effective with shorter sentences. It captures dependencies among all the possible combinations of words. However, in long sentences, capturing the dependencies among different combinations of words can be cumbersome and unpractical.

Transformer avoids recursion by processing sentences as whole using attention mechanisms and positional embeddings. Newer models such as Transformer-XL can overcome fixed input size issues as well.

Transformer Use Cases

GPT-3: Generative Pretrained Transformer-3 (GPT-3) was one of the most significant breakthroughs in 2020. GPT-3 is a third-generation language prediction model in the GPT-n series from OpenAI. With 75 billion machine learning parameters, GPT-3 broke the record of Microsoft’s Turing NLG, which was the largest language model (with 17 billion parameters) until then. 

GPT-2: GPT-2, released in 2019, is a large-transformer based language model with 1.5 billion parameters, at least ten times more parameters than the previous GPT model. GPT-2 is trained on a dataset of 8 million web pages to ‘predict the next word, given all of the previous words within some text’.

BERT: In 2018, Google open-sourced an NLP pre-training technique called Bidirectional Encoder Representations from Transformers (BERT). It was built on previous works such as semi-supervised sequence learning, ELMo, ULMFit, and Generative Pre-Training. BERT got state-of-the-art results on a range of NLP tasks.

Interestingly, NLP startup Hugging Face has a library called Transformers. It provides state-of-the-art general-purpose architectures for Natural Language Understanding and Natural Language Generation with deep interoperability between TensorFlow 2.0 and PyTorch.

Wrapping Up

Since 2017, researchers have introduced many modifications to Transformer. However, a recent study from Google Research found that most of these modifications did not improve its performance. This is also the reason why most modifications to the Transformer have not seen widespread adoption.

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM