Transfer Learning methods are primarily responsible for the breakthrough in Natural Learning Processing(NLP) these days. It can give state-of-the-art solutions by using pre-trained models to save us from the high computation required to train large models. This post gives a brief overview of DistilBERT, one outstanding performance shown by TL on natural language tasks, using some pre-trained model with knowledge distillation.
Developed by Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF, from HuggingFace, DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter. Due to the large size of BERT, it is difficult for it to put it into production. Suppose we want to use these models on mobile phones, so we require a less weight yet efficient model, that’s when Distil-BERT comes into the picture. Distil-BERT has 97% of BERT’s performance while being trained on half of the parameters of BERT. BERT-base has 110 parameters and BERT-large has 340 parameters, which are hard to deal with. For this problem’s solution, distillation technique is used to reduce the size of these large models.
Knowledge Distillation
It is considered as a knowledge-transfer model from student to teacher. In this technique, a larger model/ensemble of models is trained, and a smaller model is created to mimic that large one. Distillation refers to copy dark knowledge, for example, a desk chair can be mistaken for an armchair but it should not be mistaken with mushroom. In another way, it’s concept is similar to label smoothing, it prevents the model to be too sure about its prediction.
DistilBERT Architecture
Student Architecture/DistilBERT Architecture: General Architecture is the same as BERT except for removing token-type embeddings and the pooler while reducing the number of layers by a factor of 2, which largely impacts computation efficiency.
Student initialization: It is important to find the right time to initialize the sub-network for its convergence during the model training. Hence, initialize the student from the teacher by taking one layer out of two.
Distillation: The model has been distilled on very large batches using dynamic masking and with the next sentence prediction(NSP). Here, masking and NSP referred to the process where a word to be predicted is converted to [“MASK”] in the Masked Language model, and the entire sequence is trained to predict that particular word.
Data and compute power: The model trained on the concatenated dataset of English Wikipedia and Toronto Book Corpus[Zhu et al., 2015] on 8 16GB V100 GPUs for approximately 90 hours.
Experiment Results
General Language Understanding: DistilBERT retains 97% performance of the BERT with 40% fewer parameters. This performance is checked on the General Language Understanding Evaluation(GLUE) benchmark, which contains 9 datasets to evaluate natural language understanding systems.
Downstream task benchmark: DistilBERT gives some extraordinary results on some downstream tasks such as the IMDB sentiment classification task. It has achieved 0.6% less accuracy than BERT while the model is 40% smaller.
Size and inference speed: DistilBERT has 40% less parameters than BERT and yet 60% faster than it.
On-device computation: Average inference time of DistilBERT Question-Answering model on iPhone 7 Plus is 71% faster than a question-answering model of BERT-base.
Installation
Install HuggingFace Transformers framework via PyPI.
!pip install transformers
Demo of HuggingFace DistilBERT
You can import the DistilBERT model from transformers as shown below :
from transformers import DistilBertModel
A. Checking the configuration
from transformers import DistilBertModel, DistilBertConfig # Initializing a DistilBERT configuration configuration = DistilBertConfig() # Initializing a model from the configuration model = DistilBertModel(configuration) # Accessing the model configuration configuration = model.config
B. DistilBERT Tokenizer
Similar to BERT Tokenizer, gives end-to-end tokenization for punctuation and word piece from transformers import DistilBertTokenizer import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") inputs
The output of the tokenizer will be :
{'input_ids': tensor([[ 101, 7592, 1010, 2026, 3899, 2003, 10140, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
where, input_ids are numerical representation for the sequence that the DistilBERT model will use.
attention_mask represents which tokens to attend to or not.
You can also check DistilBERTTokenizerFast.
C. DistilBERT Model
from transformers import DistilBertTokenizer, DistilBertModel import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertModel.from_pretrained('distilbert-base-uncased') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state
The sequence of hidden-states at the output of the last layer of the model.
You can also check DistilBERTMaskedLM.
D. DistilBERT Masked Language Modeling
from transformers import DistilBertTokenizer, DistilBertForMaskedLM import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased') inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"] outputs = model(**inputs, labels=labels) loss = outputs.loss logits = outputs.logits
where, loss is the masked language modeling loss.
logits is Prediction scores of the language modeling head.
E. DistilBERT for Sequence Classification
This model contains a pooler layer on the top of pooled output that can be used for regression or classification problems.
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 outputs = model(**inputs, labels=labels) loss = outputs.loss logits = outputs.logits
where, loss is the classification loss.
logits are the classification score before Softmax.
F. DistilBERT for Multiple Choice
from transformers import DistilBertTokenizer, DistilBertForMultipleChoice import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased') model = DistilBertForMultipleChoice.from_pretrained('distilbert-base-cased') prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." choice0 = "It is eaten with a fork and a knife." choice1 = "It is eaten while held in the hand." labels = torch.tensor(0).unsqueeze(0) # choice0 is correct (according to Wikipedia ;)), batch size 1 encoding = tokenizer([[prompt, choice0], [prompt, choice1]], return_tensors='pt', padding=True) outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1 # the linear classifier still needs to be trained loss = outputs.loss logits = outputs.logits
G. DistilBERT for Token Classification
from transformers import DistilBertTokenizer, DistilBertForTokenClassification import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased') inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") labels = torch.tensor([1] * inputs["input_ids"].size(1)).unsqueeze(0) # Batch size 1 outputs = model(**inputs, labels=labels) loss = outputs.loss logits = outputs.logits
H. DistilBERT For Question Answering
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering import torch tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased') question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" inputs = tokenizer(question, text, return_tensors='pt') start_positions = torch.tensor([1]) end_positions = torch.tensor([3]) outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions) loss = outputs.loss start_scores = outputs.start_logits end_scores = outputs.end_logits
References