In the past few years, we have seen tremendous improvements in the ability of machines to deal with Natural Language. We saw algorithms breaking the state-of-the-art one after the other on a variety of language-specific tasks, all thanks to transformers. In this article, we will discuss and implement transformers in the simplest way possible using a library called Simple Transformers.
The Seq2Seq Model
Before stepping into the transformers’ territory, let’s take a brief look at the Sequence-to-Sequence models.
Sign up for your weekly dose of what's up in emerging technology.
The Sequence-to-Sequence model (seq2seq) converts a given sequence of text of fixed length into another sequence of fixed length, which we can easily relate to machine translation. But Seq2seq is not just limited to translation, in fact, it is quite efficient in tasks that require text generation.
The model uses an encoder-decoder architecture and has been very successful in machine translation and question answering tasks. It uses a stack of Long Short Term Memory(LSTM) networks or Gated Recurrent Units(GRU) in encoders and decoders.
Here is a simple demonstration of Seq2Seq model:
One major drawback of the Seq2Seq model comes from the limitation of its underlying RNNs. Though LSTMs are meant to deal with long term dependencies between the word vectors, the performance drops as the distance increases. The model also restricts parallelization.
The transformer model introduces an architecture that is solely based on attention mechanism and does not use any Recurrent Networks but yet produces results superior in quality to Seq2Seq models.It addresses the long term dependency problem of the Seq2Seq model. The transformer architecture is also parallelizable and the training process is considerably faster.
Let’s take a look at some of the important features :
Encoder: The encoder has 6 identical layers in which each layer consists of a multi-head self-attention mechanism and a fully connected feed-forward network. The multi-head attention system and feed-forward network both have a residual connection and a normalization layer.
Decoder: The decoder also consists of 6 identical layers with an additional sublayer in each of the 6 layers. The additional sublayer performs multi-head attention over the output of the encoder stack.
Attention is the mapping of a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The attention mechanism allows the model to understand the context of a text.
- Scaled Dot-Product Attention:
- Multi-Head Attention:
The transformer architecture is a breakthrough in the NLP spectrum, giving rise to many state-of-the-art algorithms such as Google’s BERT, RoBERTa, OpenGPT and many others.
Text Classification With Transformers
In this hands-on session, you will be introduced to Simple Transformers library. The library is built on top of the popular huggingface transformers library and consists of implementations of various transformer-based models and algorithms.
The library makes it effortless to implement various language modeling tasks such as Sequence Classification, Token Classification (NER), and Question Answering.
So without further ado let’s get our hands dirty!
Introduction To Simple Transformers
The Simple Transformers library is made with the objective of making the implementation as simple as possible and it has quite achieved it. Transformers can now be used effortlessly with just a few lines of code. All credit goes to Simple Transformers — Multi-Class Text Classification with BERT, RoBERTa, XLNet, XLM, and DistilBERT and huggingface transformers.
Installing Simple Transformers
Type and execute the following command to install the simple transformers library.
!pip install simpletransformers
Creating A Classifier Model
from simpletransformers.classification import ClassificationModel
#Create a ClassificationModel
model = ClassificationModel(model_type, model_name, number_of_labels, use_cuda = boolean)
- model_type: This parameter can be one of ‘bert’, ‘xlnet’, ‘xlm’, ‘roberta’, ‘distilbert’
- model_name: All available model names can be found here.
- number_of_labels: These are a number of unique labels or classes in the problem.
- use_cuda: When set to true uses the CUDA framework for GPUs.
The ClassificationModel also has dict args which contains attributes for controlling the values of hyperparameters.The default argument list is given below :
Training The Model
The train_model method can be used to train the model. The method accepts a dataframe.
The method also saves checkpoints of the model to the path if specified using the dict args.
Evaluating The Classifier
The eval_model method evaluates the model on a validation set and returns the metrics, the outputs of the model as well as the wrong predictions.
result, model_outputs, wrong_predictions = model.eval_model(validation_dataframe)
The predict method returns predictions and row outputs that contains a value for each class in the predicted labels.
predictions, raw_outputs = model.predict(['input sentence']