Google AI unveiled a new neural network architecture called Transformer in 2017. The GoogleAI team had claimed the Transformer worked better than leading approaches such as recurrent neural networks and convolutional models on translation benchmarks.
In four years, Transformer has become the talk of the town: A big part of the credit goes to its self-attention mechanism, which helps models to focus on only certain parts of the input and reason more effectively. BERT and GPT-3 are some popular Transformers.
Now, the looming question is: With Transformer adoption on the rise, could it surpass or become as popular as RNN and CNN?
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
What Is A Transformer?
As per the original 2017 paper, titled ‘Attention Is All You Need’, Transformers perceives the entire input sequences simultaneously. It depends on transforming one sequence into another, like the other usual sequence-to-sequence models, plus employing the attention mechanism.
Download our Mobile App
In NLP models, the attention mechanism considers the relationship between words, irrespective of where they are placed in a sentence. A Transformer performs a small but constant number of empirically chosen steps. At each step, it applies a relationship between all the words in the sentence using the self-attention mechanism. To compute the representation for a given word, Transformer compares it with every other word in the sentence.
Attention Mechanism for English To French Text Translation (Credit: Trungtran.io)
Comparison With Other Methods
Before the introduction of Transformer, most state-of-the-art NLP models were based on RNN. RNN processes data sequentially — word by word to access the cell of the last word. RNN is not very efficient in handling long sequences. The model tends to forget the contents of the distant position or, in some cases, mixes the contents of adjacent positions: the more the steps, the more challenging for the recurrent network to make decisions. The sequential nature of RNNs makes it further difficult to take full advantage of modern fast computing devices such as TPUs and GPUs.
The Long Short Term Memory (LSTM) offers a slight improvement over conventional RNN. LSTM leverages the Gate mechanism to determine which information the cell needs to remember and which to forget. It can also eliminate the vanishing gradient problem that RNN suffers from. LSTM is good but not good enough. Like RNN, LSTM cannot be trained in parallel.
Multilayer Perceptrons (MLP) is a basic neural network, which was highly popular in the 1980s. However, it has been outdated for any heavy lifting compared to networks such as CNN or RNN.
Convolutional Neural Network has an advantage over RNNs (and LSTMs) as they are easy to parallelise. CNNs find wide application in NLP as they are fast to train and are effective with shorter sentences. It captures dependencies among all the possible combinations of words. However, in long sentences, capturing the dependencies among different combinations of words can be cumbersome and unpractical.
Transformer avoids recursion by processing sentences as whole using attention mechanisms and positional embeddings. Newer models such as Transformer-XL can overcome fixed input size issues as well.
Transformer Use Cases
GPT-3: Generative Pretrained Transformer-3 (GPT-3) was one of the most significant breakthroughs in 2020. GPT-3 is a third-generation language prediction model in the GPT-n series from OpenAI. With 75 billion machine learning parameters, GPT-3 broke the record of Microsoft’s Turing NLG, which was the largest language model (with 17 billion parameters) until then.
GPT-2: GPT-2, released in 2019, is a large-transformer based language model with 1.5 billion parameters, at least ten times more parameters than the previous GPT model. GPT-2 is trained on a dataset of 8 million web pages to ‘predict the next word, given all of the previous words within some text’.
BERT: In 2018, Google open-sourced an NLP pre-training technique called Bidirectional Encoder Representations from Transformers (BERT). It was built on previous works such as semi-supervised sequence learning, ELMo, ULMFit, and Generative Pre-Training. BERT got state-of-the-art results on a range of NLP tasks.
Interestingly, NLP startup Hugging Face has a library called Transformers. It provides state-of-the-art general-purpose architectures for Natural Language Understanding and Natural Language Generation with deep interoperability between TensorFlow 2.0 and PyTorch.
Since 2017, researchers have introduced many modifications to Transformer. However, a recent study from Google Research found that most of these modifications did not improve its performance. This is also the reason why most modifications to the Transformer have not seen widespread adoption.