How To Implement Intent Recognition With BERT

Intent Classification, or you may say Intent Recognition is the labour of getting a spoken or written text and then classifying it based on what the user wants to achieve

Intent Classification, or you may say Intent Recognition is the labour of getting a spoken or written text and then classifying it based on what the user wants to achieve. This is a form of Natural Language Processing(NLP) task, which is further a subdomain of Artificial Intelligence. This technique finds vital application in chatbots (we all have seen and conversed with numerous chatbots on different websites). Use cases involve customer support, sales, gathering insights from Anonymous Chat Data (ACD), etc.

NLP, as defined, is concerned with computation and processing, analyzing and extracting features out of natural language. One may think how it is different from Text Classification and Question Answering; the answer is given below in the following example for better understanding.

Let’s take two sentences from a chat; these are-


Sign up for your weekly dose of what's up in emerging technology.
  1. I am hungry.
  2. I have not eaten anything since morning.

These two sentences have different syntactic meanings – these are seen as facts when solving classical Text Classification problems. But let’s try and take a look from some other view (intent of the sentences). Sentence 1 sure shot says that the person is hungry. Sentence 2, however, does not contain the word hungry, yet we can understand the ‘intent’ of the person he or she is hungry at that moment considering not eating anything. 

Recognizing intents like this and so many others is the task we have to achieve. This can be done using BERT (a simple stack of encoders on top of one another) as we have to understand what the sentence is telling us.  Annotated data, that is the data should be labelled with intent; hence it is a supervised problem statement.    

The rise of chatbots is in the direction of user trust in chatbots, gaining more popularity. A survey by Helpshift found out that the number of people willing to use chatbots was twice in 2019 as that of 2018.  This is because of obvious reasons with the advancements in text models and in time support. Not only it saves unimaginable human hours but it gives humongous profits.

This article will use a pre-trained BERT model that will be huge (leveraging the model checkpoints) and fine-tune it to our needs with labelled text data with seven intents. The data is divided into three parts, namely – train test and validity.  All the data will be provided for you to download in the form of a zip file here. This contains the model architecture as well as the datasets we are going to use. The datasets are pre-processed to make this article sufficiently enough; else we’ll be at it for hours. Saving all that time, we’ll write up some functions for encoding and then define the model accordingly.

Let’s jump into the coding part already.

Code Implementation of Intent Recognition with BERT

Importing Necessary Dependencies
 # checking for the GPU we get for this model 
 # installing the latest version of tensorflow GPU
 !pip install tensorflow-gpu >> /dev/null
 !pip install --upgrade grpcio >> /dev/null
 # show progress bars while installation and downloading 
 !pip install tqdm  >> /dev/null
 # the original implementation given in research paper is not compatible with TensorFlow 2. The bert-for-tf2 package solves this issue.
 !pip install bert-for-tf2 >> /dev/null
 # install sentencepiece tokenizer for making tokens out of sentences
 !pip install sentencepiece >> /dev/null
 # making directories, math operations
 import os
 import math
 import datetime
 # show progress bars in the notebook itself
 from tqdm import tqdm
 # data manipulation in dataframe and array
 import pandas as pd
 import numpy as np
 # keras implementation
 import tensorflow as tf
 from tensorflow import keras
 # import the bert model and model layer 
 # also the bert tokenizer for text pre processing
 import bert
 # we can change the layers downloaded from here
 from bert import BertModelLayer
 # useful for loading necessary files from data
 from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
 from bert.tokenization.bert_tokenization import FullTokenizer
 # data visualization at the time of data import and model metrics
 import seaborn as sns
 # we can set the parameters for each plot 
 from pylab import rcParams
 import matplotlib.pyplot as plt
 from matplotlib.ticker import MaxNLocator
 from matplotlib import rc
 # model metrics to be defined 
 from sklearn.metrics import confusion_matrix, classification_report
 # showing the plots in the notebook itself
 %matplotlib inline
 # retina is a style shape
 %config InlineBackend.figure_format='retina'
 # setting the visualizing settings first hand
 sns.set(style='whitegrid', palette='muted', font_scale=1.2)
 # getting the colors we need for plots
 HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
 # define the color palette
 # size can be set accordingly
 rcParams['figure.figsize'] = 12, 8
 # generating pseudo random numbers
 # np arrays , tensors
Downloading Data and Model zip
 # downloading the data from google drives which is in now form of csv files having two columns of sentences and it's intent
 # earlier the data was in form of json file but after a little bit of data pre processing , we are using it in the form of csv files 
 !gdown --id 1OlcvGWReJMuyYQuOZm149vHWwPtlboR6 --output train.csv
 !gdown --id 1Oi5cRlTybuIF2Fl5Bfsr-KkqrXrdt77w --output valid.csv
 !gdown --id 1ep9H6-HvhB4utJRLVcLzieWNUSG3P_uF --output test.csv
 # making separate dataframes for the three files 
 # which are already pre processed
 train = pd.read_csv("train.csv")
 valid = pd.read_csv("valid.csv")
 test = pd.read_csv("test.csv")
 # appending the valid df to train df to make a bigger data
 train = train.append(valid).reset_index(drop=True)
 # training df has 13784 no. of records
 # 13,784 data points with intents
 # let's check how the datset looks , if it is imbalanced or not
 # making a chart to see how many texts are for each of the seven intents
 chart = sns.countplot(train.intent, palette=HAPPY_COLORS_PALETTE)
 # setting title of the chart
 plt.title("Number of texts per intent")
 # setting labels to the bottom at an angle of 30 degrees
 chart.set_xticklabels(chart.get_xticklabels(), rotation=30, horizontalalignment='right'); 

While running the code, after running this, we see that the dataset is balanced.

 # download the pre trained bert model in the form of a zip file
 # unzipping the file having model
 # this is having all the checkpoints , vocab.text for tokenizer , configurations for model
 # make a folder named model for moving all the stuff 
 os.makedirs("model", exist_ok=True)
 # moving the stuff from bert to model folder 
 !mv uncased_L-12_H-768_A-12/ model
 # specify the model name 
 # specify where the  check point directory is 
 bert_ckpt_dir = os.path.join("model/", bert_model_name)
 # specify the checkpoint file itself
 bert_ckpt_file = os.path.join(bert_ckpt_dir, "bert_model.ckpt")
 # specify the configuration file 
 bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json") 
 # pre processing in which we have to use the tokenizer
 # for getting tokens so that we can feed the model
 # creating a class which is a little generic for doing our tokenizing process
 class IntentDetectionData:
   # we had two columns to which have given names according to use
   DATA_COLUMN = "text"
   LABEL_COLUMN = "intent"
   creating constructor having train data , test data , tokenzier , max_seq_len because NLP tasks work on a fixed number of elements 
   the data set we are having has different lengths of sequences , that's why we have to do same length for all  , either we can cut the sequence 
   lengths to some minimal point or we can add padding up to the maximum existing 
   def __init__(self, train, test, tokenizer: FullTokenizer, classes, max_seq_len=192):
     self.tokenizer = tokenizer
     # setting initial to 0
     self.max_seq_len = 0
     self.classes = classes
     create train_X and train_Y 
     train_X will have the vectors , train_Y will have the target labels 
     same we will do for the testing part
     train, test = map(lambda df: df.reindex(df[IntentDetectionData.DATA_COLUMN].str.len().sort_values().index), [train, test])
     # we gave to set the variables to which we can map values using the above function
     ((self.train_x, self.train_y), (self.test_x, self.test_y)) = map(self._prepare, [train, test])
     print("max seq_len", self.max_seq_len)
     # setting the actual value to max_seq_len
     self.max_seq_len = min(self.max_seq_len, max_seq_len)
     # adding padding in the following line
     self.train_x, self.test_x = map(self._pad, [self.train_x, self.test_x])
   def _prepare(self, df):
     # initialising two empty arrays
     x, y = [], []
     # iterate over each row and use tqdm to visualize that
     for _, row in tqdm(df.iterrows()):
       # extracting the seq. and labels
       text, label = row[IntentDetectionData.DATA_COLUMN], row[IntentDetectionData.LABEL_COLUMN]
       # create an object named token and initialise it
       tokens = self.tokenizer.tokenize(text)
       # adding two special tokens which will surround each token 
       # from front and back 
       tokens = ["[CLS]"] + tokens + ["[SEP]"]
       # convert the token ids to numbers using a helper
       # function already provided for feeding
       token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
       # calculate the length of seq. by taking max. 
       self.max_seq_len = max(self.max_seq_len, len(token_ids))
       # appends the token ids to x
       # indexes of labels to the y
     # returning these in the form of nparrays
     return np.array(x), np.array(y)
   # takes the ids and not the tokens itself
   def _pad(self, ids):
     # create an empty list again
     x = []
     # iterating over each id
     for input_ids in ids:
       # we have to cut the longer seq. , self.max_seq_len - 2 
       # because two positions are reserved for the special tokens as mentioned before
       input_ids = input_ids[:min(len(input_ids), self.max_seq_len - 2)]
       # adding padding to the smaller ones
       input_ids = input_ids + [0] * (self.max_seq_len - len(input_ids))
       # appending  
     return np.array(x)
 # use tokenizer which expects a vocabulary file, 
 # we have this file already in our data
 tokenizer = FullTokenizer(vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt"))
 # trying the tokenizer , lower cases everything , good tokenizer as it preserves punctuations
 tokenizer.tokenize("I can't wait to visit India again!")
 # example of tokenization and then converting them to token ids which are basically numbers for our model to understand 
 tokens = tokenizer.tokenize("I can't wait to visit India again!")
 # create model to use the pre trained bert model and convert it into intent recogniser by adding keras layers
 # parameters are to be the max_seq_len , checkpoint file
 def create_model(max_seq_len, bert_ckpt_file, bert_config_file):
   # open the config file using tensorflow for reading
   with, "r") as reader:
     # using this variable read the config file 
       bc = StockBertConfig.from_json_string(
       bert_params = map_stock_config_to_params(bc)
       # adapter_size is for fine tune the model faster 
       bert_params.adapter_size = None
       # this is the bert model named simply bert for ease
       # this is actually very huge stack of keras layers 
       bert = BertModelLayer.from_params(bert_params, name="bert")
   # create the input layer with seq. len as input       
   input_ids = keras.layers.Input(shape=(max_seq_len, ), dtype='int32', name="input_ids")
   # output from the bert model is being saved here
   # to which we provided our processed data (ids)
   bert_output = bert(input_ids)
   # printing the shape of bert model
   print("bert shape", bert_output.shape)
   # use lambda layer from keras , taking the first layer out of those 38 layers 
   # and rest we have taken all 768
   cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(bert_output)
   # drop out layers to prevent overfitting using the keras API with 50% probability 
   cls_out = keras.layers.Dropout(0.5)(cls_out)
   # flatten this using the dense layer , activation function used hai tanh
   logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)
   # adding one more drop out layer 
   logits = keras.layers.Dropout(0.5)(logits)
   # dense o/p layer , activation to be used is softmax as this is classification 
   logits = keras.layers.Dense(units=len(classes), activation="softmax")(logits)
   # this is going to be a keras model , i/p - input_ids , o/p - logits from the last layer 
   model = keras.Model(inputs=input_ids, outputs=logits)
   building the model and specify the i/p shape of this whole model 
   the first parameter is None because this will be initialised by keras for us 
   ''', max_seq_len))
   # load the weights
   load_stock_weights(bert, bert_ckpt_file)
   return model
 # taking unique intents and convert them to list 
 classes = train.intent.unique().tolist()
 # already defined function which we created earlier
 data = IntentDetectionData(train, test, tokenizer, classes, max_seq_len=128)
 # we can see the test and train data by running this cell
 # these many token ids with these many seq. len
 # this is how a data point looks 
 # this is a single id
 # max len of token ids
 # create the model with three input parameters 
 model = create_model(data.max_seq_len, bert_ckpt_file, bert_config_file)
 # we get some unused weights
 these are the layers that we put earlier in our model 
 bert layer has around 108 mil param
 last dense layer will have the o/p only which is 7 classes
 # compiling the model 
   # optimizer is Adam with learning rate 0.00001 ,  this is recommended by authors of bert model paper
   # using SparseCategoricalCrossentropy
   # because we have not used OneHotencoding here 
   # specify the metrics from keras 
 # use a tensorboard callback so that we can see the training processafter everything is completed 
 log_dir = "log/intent_detection/" +"%Y%m%d-%H%M%s")
 # create a callback from keras 
 tensorboard_callback = keras.callbacks.TensorBoard(log_dir=log_dir)
 # finally fit the model
 history =
   # validation split to be 10% of the data 
   # recommendation from authors , batch to be 16
   # yes we want to shuffle 
   # another recommendation from authors
   # specify the call back 
 # showing the tensorboard
 %tensorboard --logdir log 
 # how loss has decreased with epoch number 
 ax = plt.figure().gca()
 plt.legend(['train', 'test'])
 plt.title('Loss over training epochs');
 # graph for accuracy against the epoch number 
 ax = plt.figure().gca()
 plt.legend(['train', 'test'])
 plt.title('Accuracy over training epochs');
 # evaluate the test accuracy , call evaluate method and give the data 
 _, train_acc = model.evaluate(data.train_x, data.train_y)
 _, test_acc = model.evaluate(data.test_x, data.test_y)
 print("train acc", train_acc)
 print("test acc", test_acc)
 # take the class that is most likely to be the correct one based on model's opinion
 y_pred = model.predict(data.test_x).argmax(axis=-1) 
Prediction with examples
 # print classification report 
 print(classification_report(data.test_y, y_pred, target_names=classes))
 # create a confusion matrix and chart it
 cm = confusion_matrix(data.test_y, y_pred)
 df_cm = pd.DataFrame(cm, index=classes, columns=classes)
 # using a heatmap from seaborn to visualize the confusion matric
 hmap = sns.heatmap(df_cm, annot=True, fmt="d")
 hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
 hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
 plt.ylabel('True label')
 plt.xlabel('Predicted label');
 # trying a few sentences which the model has not seen 
 sentences = [
   "Rate this book as awful"
 pred_tokens = map(tokenizer.tokenize, sentences)
 pred_tokens = map(lambda tok: ["[CLS]"] + tok + ["[SEP]"], pred_tokens)
 pred_token_ids = list(map(tokenizer.convert_tokens_to_ids, pred_tokens))
 pred_token_ids = map(lambda tids: tids +[0]*(data.max_seq_len-len(tids)),pred_token_ids)
 pred_token_ids = np.array(list(pred_token_ids))
 predictions = model.predict(pred_token_ids).argmax(axis=-1)
 for text, label in zip(sentences, predictions):
   print("text:", text, "\nintent:", classes[label])


As we can see in the output, the model has successfully recognized the intent on never seen examples. I recommend the readers to try this model on different datasets, altering the model hyperparameters. This is a beautiful, more pythonic way of implementing a model so make sure to understand why generic functions in such a big model are vital.


More Great AIM Stories

Mudit Rustagi
Mudit is experienced in machine learning and deep learning. He is an undergraduate in Mechatronics and worked as a team lead (ML team) for several Projects. He has a strong interest in doing SOTA ML projects and writing blogs on data science and machine learning.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM