MITB Banner

Python Guide to HuggingFace DistilBERT – Smaller, Faster & Cheaper Distilled BERT

Share

DistilBERT

Transfer Learning methods are primarily responsible for the breakthrough in Natural Learning Processing(NLP) these days. It can give state-of-the-art solutions by using pre-trained models to save us from the high computation required to train large models. This post gives a brief overview of DistilBERT, one outstanding performance shown by TL on natural language tasks, using some pre-trained model with knowledge distillation. 

Developed by Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF, from HuggingFace, DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter. Due to the large size of BERT, it is difficult for it to put it into production. Suppose we want to use these models on mobile phones, so we require a less weight yet efficient model, that’s when Distil-BERT comes into the picture. Distil-BERT has 97% of BERT’s performance while being trained on half of the parameters of BERT. BERT-base has 110 parameters and BERT-large has 340 parameters, which are hard to deal with. For this problem’s solution, distillation technique is used to reduce the size of these large models. 

Knowledge Distillation

It is considered as a knowledge-transfer model from student to teacher. In this technique, a larger model/ensemble of models is trained, and a smaller model is created to mimic that large one. Distillation refers to copy dark knowledge, for example, a desk chair can be mistaken for an armchair but it should not be mistaken with mushroom. In another way, it’s concept is similar to label smoothing, it prevents the model to be too sure about its prediction.

DistilBERT Architecture

Student Architecture/DistilBERT Architecture: General Architecture is the same as BERT except for removing token-type embeddings and the pooler while reducing the number of layers by a factor of 2, which largely impacts computation efficiency.

Student initialization: It is important to find the right time to initialize the sub-network for its convergence during the model training. Hence, initialize the student from the teacher by taking one layer out of two.

Distillation: The model has been distilled on very large batches using dynamic masking and with the next sentence prediction(NSP). Here, masking and NSP referred to the process where a word to be predicted is converted to [“MASK”] in the Masked Language model, and the entire sequence is trained to predict that particular word.

Data and compute power: The model trained on the concatenated dataset of English Wikipedia and Toronto Book Corpus[Zhu et al., 2015] on 8 16GB V100 GPUs for approximately 90 hours. 

Experiment Results

General Language Understanding: DistilBERT retains 97% performance of the BERT with 40% fewer parameters. This performance is checked on the General Language Understanding Evaluation(GLUE) benchmark, which contains 9 datasets to evaluate natural language understanding systems.

Downstream task benchmark: DistilBERT gives some extraordinary results on some downstream tasks such as the IMDB sentiment classification task. It has achieved 0.6% less accuracy than BERT while the model is 40% smaller. 

Size and inference speed: DistilBERT has 40% less parameters than BERT and yet 60% faster than it.

On-device computation: Average inference time of DistilBERT Question-Answering model on iPhone 7 Plus is 71% faster than a question-answering model of BERT-base.

Installation

Install HuggingFace Transformers framework via PyPI.

!pip install transformers

Demo of HuggingFace DistilBERT

You can import the DistilBERT model from transformers as shown below : 

from transformers import DistilBertModel

A. Checking the configuration

 from transformers import DistilBertModel, DistilBertConfig
 # Initializing a DistilBERT configuration
 configuration = DistilBertConfig()
 # Initializing a model from the configuration
 model = DistilBertModel(configuration)
 # Accessing the model configuration
 configuration = model.config 

B. DistilBERT Tokenizer 

 Similar to BERT Tokenizer, gives end-to-end tokenization for punctuation and word piece
 from transformers import DistilBertTokenizer
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 inputs 

The output of the tokenizer will be :

{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

where, input_ids are numerical representation for the sequence that the DistilBERT model will use.

attention_mask represents which tokens to attend to or not.

You can also check DistilBERTTokenizerFast.

C. DistilBERT Model

 from transformers import DistilBertTokenizer, DistilBertModel
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertModel.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 outputs = model(**inputs)
 last_hidden_states = outputs.last_hidden_state 

The sequence of hidden-states at the output of the last layer of the model.

You can also check DistilBERTMaskedLM.

D. DistilBERT Masked Language Modeling

 from transformers import DistilBertTokenizer, DistilBertForMaskedLM
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
 labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
 outputs = model(**inputs, labels=labels)
 loss = outputs.loss
 logits = outputs.logits 

where, loss is the masked language modeling loss.

logits is Prediction scores of the language modeling head.

E. DistilBERT for Sequence Classification

This model contains a pooler layer on the top of  pooled output that can be used for regression or classification problems.

 from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
 outputs = model(**inputs, labels=labels)
 loss = outputs.loss
 logits = outputs.logits 

where, loss is the classification loss.

logits are the classification score before Softmax.

F. DistilBERT for Multiple Choice

 from transformers import DistilBertTokenizer, DistilBertForMultipleChoice
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
 model = DistilBertForMultipleChoice.from_pretrained('distilbert-base-cased')
 prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
 choice0 = "It is eaten with a fork and a knife."
 choice1 = "It is eaten while held in the hand."
 labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1
 encoding = tokenizer([[prompt, choice0], [prompt, choice1]], return_tensors='pt', padding=True)
 outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels) # batch size is 1
 # the linear classifier still needs to be trained
 loss = outputs.loss
 logits = outputs.logits 

G. DistilBERT for Token Classification 

 from transformers import DistilBertTokenizer, DistilBertForTokenClassification
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased')
 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 labels = torch.tensor([1] * inputs["input_ids"].size(1)).unsqueeze(0)  # Batch size 1
 outputs = model(**inputs, labels=labels)
 loss = outputs.loss
 logits = outputs.logits 

H. DistilBERT For Question Answering

 from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
 import torch
 tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
 model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
 inputs = tokenizer(question, text, return_tensors='pt')
 start_positions = torch.tensor([1])
 end_positions = torch.tensor([3])
 outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
 loss = outputs.loss
 start_scores = outputs.start_logits
 end_scores = outputs.end_logits 

References 

Share
Picture of Aishwarya Verma

Aishwarya Verma

A data science enthusiast and a post-graduate in Big Data Analytics. Creative and organized with an analytical bent of mind.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.