Recent advances in modern Natural Language Processing (NLP) research have been dominated by the combination of Transfer Learning methods with large-scale Transformer language models.
Creating these general-purpose models remains an expensive and time-consuming process restricting the use of these methods to a small subset of the wider NLP community. With Transformers, came a paradigm shift in NLP with the starting point for training a model on a downstream task moving from a blank specific model to a general-purpose pre-trained architecture.
How Transformers Took Over From Other Architectures
In NLU, there are challenges like similar sounding words that can be given higher scores during training from the corpus. For example, the word ‘wound’ can be used for indicating an injury or wrapping up of something. The chances are that homonyms such as these will be given higher scores for their ambiguity and the weights that are used to calculate the weighted average will give a different representation of the same word.
A Transformer network applies self-attention mechanism which scans through every word and appends attention scores(weights) to the words. The Transformer was introduced as a simple network architecture, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Neural networks usually process language by generating fixed-or-variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context.
Though RNNs have in recent years become the typical network architecture for translation, processing language sequentially, their sequential nature makes it difficult to harness parallel processing units like TPUs fully. Convolutional neural networks (CNNs), on the other hand, though less sequential, take a relatively large number of steps to combine information.
Whereas, the output of the transformer network, which also happens to be the final hidden state is taken as the first token for the input and the probability of selecting a random label is calculated using standard softmax function.
The same formula is used for the end of the answer span where the maximum scoring span is used as the prediction.
Top NLP Models Using Transformers
- BERT, or Bidirectional Encoder Representations from Transformers, set new benchmarks for NLP when it was introduced by Google late last year. This novel model is a new method of pre-training language representations which obtained state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
- XLNet is another new unsupervised language representation learning method based on a novel generalised permutation language modelling objective. XLNet employed Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieved state-of-the-art (SOTA) results on various downstream language tasks, including question answering, natural language inference, sentiment analysis, and document ranking.
- Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture.
BERT itself has paved the way to newer models. Since state-of-the-art models are mostly based on BERT and BERT is formulated on transformer architecture, we can safely assume that the Transformer model has taken the throne for natural language understanding.
This was made possible because of the Transformer allowed for significantly more parallelisation and reached a new state of the art in translation quality.
Beyond computational performance and higher accuracy, Transformer also enabled visualisation of what other parts of a sentence the network attends to when processing or translating a given word, thus gaining insights into how information travels through the network.
It Has Its Own Library Now
In what can be exciting news to the machine learning community, for the developers in NLP domain especially, the team at Huggingface had released a library called Transformers.
This library now provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
As NLP becomes a key aspect of AI, the democratisation of the Transformers in the form of a library will open more doors to the up and coming researchers. As the state-of-the-art pre-trained models like BERT and GPT-2 can be accessed without having to build it from scratch, entry-level practitioners can now focus on their target idea rather than reinventing the wheel.