Active Hackathon

Python Guide to HuggingFace DistilBERT – Smaller, Faster & Cheaper Distilled BERT


Transfer Learning methods are primarily responsible for the breakthrough in Natural Learning Processing(NLP) these days. It can give state-of-the-art solutions by using pre-trained models to save us from the high computation required to train large models. This post gives a brief overview of DistilBERT, one outstanding performance shown by TL on natural language tasks, using some pre-trained model with knowledge distillation. 

Developed by Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF, from HuggingFace, DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter. Due to the large size of BERT, it is difficult for it to put it into production. Suppose we want to use these models on mobile phones, so we require a less weight yet efficient model, that’s when Distil-BERT comes into the picture. Distil-BERT has 97% of BERT’s performance while being trained on half of the parameters of BERT. BERT-base has 110 parameters and BERT-large has 340 parameters, which are hard to deal with. For this problem’s solution, distillation technique is used to reduce the size of these large models. 


Sign up for your weekly dose of what's up in emerging technology.

Knowledge Distillation

It is considered as a knowledge-transfer model from student to teacher. In this technique, a larger model/ensemble of models is trained, and a smaller model is created to mimic that large one. Distillation refers to copy dark knowledge, for example, a desk chair can be mistaken for an armchair but it should not be mistaken with mushroom. In another way, it’s concept is similar to label smoothing, it prevents the model to be too sure about its prediction.

DistilBERT Architecture

Student Architecture/DistilBERT Architecture: General Architecture is the same as BERT except for removing token-type embeddings and the pooler while reducing the number of layers by a factor of 2, which largely impacts computation efficiency.

Student initialization: It is important to find the right time to initialize the sub-network for its convergence during the model training. Hence, initialize the student from the teacher by taking one layer out of two.

Distillation: The model has been distilled on very large batches using dynamic masking and with the next sentence prediction(NSP). Here, masking and NSP referred to the process where a word to be predicted is converted to [“MASK”] in the Masked Language model, and the entire sequence is trained to predict that particular word.

Data and compute power: The model trained on the concatenated dataset of English Wikipedia and Toronto Book Corpus[Zhu et al., 2015] on 8 16GB V100 GPUs for approximately 90 hours. 

Experiment Results

General Language Understanding: DistilBERT retains 97% performance of the BERT with 40% fewer parameters. This performance is checked on the General Language Understanding Evaluation(GLUE) benchmark, which contains 9 datasets to evaluate natural language understanding systems.

Downstream task benchmark: DistilBERT gives some extraordinary results on some downstream tasks such as the IMDB sentiment classification task. It has achieved 0.6% less accuracy than BERT while the model is 40% smaller. 

Size and inference speed: DistilBERT has 40% less parameters than BERT and yet 60% faster than it.

On-device computation: Average inference time of DistilBERT Question-Answering model on iPhone 7 Plus is 71% faster than a question-answering model of BERT-base.


Install HuggingFace Transformers framework via PyPI.

!pip install transformers

Demo of HuggingFace DistilBERT

You can import the DistilBERT model from transformers as shown below : 

from transformers import DistilBertModel

A. Checking the configuration

 from transformers import DistilBertModel, DistilBertConfig
 # Initializing a DistilBERT configuration
 configuration = DistilBertConfig()
 # Initializing a model from the configuration
 model = DistilBertModel(configuration)
 # Accessing the model configuration
 configuration = model.config 

B. DistilBERT Tokenizer 

 Similar to BERT Tokenizer, gives end-to-end tokenization for punctuation and word piece
 from transformers import DistilBertTokenizer
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

The output of the tokenizer will be :

{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

where, input_ids are numerical representation for the sequence that the DistilBERT model will use.

attention_mask represents which tokens to attend to or not.

You can also check DistilBERTTokenizerFast.

C. DistilBERT Model

 from transformers import DistilBertTokenizer, DistilBertModel
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertModel.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 outputs = model(**inputs)
 last_hidden_states = outputs.last_hidden_state 

The sequence of hidden-states at the output of the last layer of the model.

You can also check DistilBERTMaskedLM.

D. DistilBERT Masked Language Modeling

 from transformers import DistilBertTokenizer, DistilBertForMaskedLM
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
 labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
 outputs = model(**inputs, labels=labels)
 loss = outputs.loss
 logits = outputs.logits 

where, loss is the masked language modeling loss.

logits is Prediction scores of the language modeling head.

E. DistilBERT for Sequence Classification

This model contains a pooler layer on the top of  pooled output that can be used for regression or classification problems.

 from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
 outputs = model(**inputs, labels=labels)
 loss = outputs.loss
 logits = outputs.logits 

where, loss is the classification loss.

logits are the classification score before Softmax.

F. DistilBERT for Multiple Choice

 from transformers import DistilBertTokenizer, DistilBertForMultipleChoice
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
 model = DistilBertForMultipleChoice.from_pretrained('distilbert-base-cased')
 prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
 choice0 = "It is eaten with a fork and a knife."
 choice1 = "It is eaten while held in the hand."
 labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1
 encoding = tokenizer([[prompt, choice0], [prompt, choice1]], return_tensors='pt', padding=True)
 outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1
 # the linear classifier still needs to be trained
 loss = outputs.loss
 logits = outputs.logits 

G. DistilBERT for Token Classification 

 from transformers import DistilBertTokenizer, DistilBertForTokenClassification
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 labels = torch.tensor([1] * inputs["input_ids"].size(1)).unsqueeze(0)  # Batch size 1
 outputs = model(**inputs, labels=labels)
 loss = outputs.loss
 logits = outputs.logits 

H. DistilBERT For Question Answering

 from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
 inputs = tokenizer(question, text, return_tensors='pt')
 start_positions = torch.tensor([1])
 end_positions = torch.tensor([3])
 outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
 loss = outputs.loss
 start_scores = outputs.start_logits
 end_scores = outputs.end_logits 


More Great AIM Stories

Aishwarya Verma
A data science enthusiast and a post-graduate in Big Data Analytics. Creative and organized with an analytical bent of mind.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM