Active Hackathon

How Does A Simple Chatbot With NLTK Work?

Chatbots are programs that are made to interact with human beings. This can be for various purposes such as information retrieval or for customer care. Today chatbots are intelligently capable of closely mimicking a human chat session. Whether you consider these chatbots to be amusing or not, they have already taken over a huge part of the internet.

In this article, you will learn in the simplest way possible, the complete anatomy of the easiest chatbot one can build using NLTK and python. The code that is discussed below has drawn its inspiration from Building-a-Simple-Chatbot-in-Python-using-NLTK by Parul Pandey. I have also made some changes to the code for simplicity and better explanation.


Sign up for your weekly dose of what's up in emerging technology.

In the following session, we will do a complete code walkthrough and line by line explanation of the code that is mentioned above.

Before we begin, make sure to check out MachineHack’s latest hackathon- Predicting The Costs Of Used Cars – Hackathon By Imarticus Learning. Click here to participate and win exciting prizes.

Overview Of Chatty

This chatbot has the ability to parse a document of textual information and answer the queries of the user. The chatbot uses the Natural Language Processing Toolkit (NLTK) to process the textual information.

Let us begin!

First of all, we will start by importing NLTK and String libraries and downloading some data needed to process text from nltk.

import nltk
import string

#Download only once'punkt')  #pre-trained tokenizer for English'wordnet') #lexical database for the English language

Now we need to feed some information into the chatbot so that it can answer to our queries. Copy some piece of information from the internet and store it in a variable. Make sure to have at least 3 sentences to make our chatbot a  little wise. You can also load data from a file or webpage as shown in the original code.

text = """Analytics India Magazine (AIM) is India’s no.1 platform on analytics, data science and big data, dedicated to …."""

We will now convert all the letters in the text to lowercase to ensure that no same word is counted multiple times due to case sensitivity.

text = text.lower()

Now we need to tokenize the words and sentences. The below lines of code will create two lists, the first one will consist of all the sentences in the text and the second will consist of all the words in the text. We will use the sentence tokenizer and word tokenizer methods from nltk as shown below.

sentences = nltk.sent_tokenize(text)
tokens = nltk.word_tokenize(text)

After we tokenize, we will start cleaning up the tokens by Lemmatizing, removing the stopwords and removing the punctuations. Lemmatizing is the process of converting a word into its root form. For example, words, like run, ran and running all convey the same meaning and hence don’t need to be considered as different words, lemmatizing will reduce all the words to run. Stopwords represent the most frequent words used in Natural Language such as ‘a’, ‘is’,’ ‘what’ etc which do not add any value to the capability of the text classifier, so we remove them as well.

#Initializing the WordNetLemmatizer
lemmer = nltk.stem.WordNetLemmatizer()

#Importing the stopwords
from nltk.corpus import stopwords

#Lemmatizing the words or tokens
def LemTokens(tokens):
   return [lemmer.lemmatize(token,'v') for token in tokens if token not in set(stopwords.words('english')) ]

The above code block will reduce a word to its root form while also removing the stopwords. For example, lemmer.lemmatize('running','v') will output Out: 'run'.

#A dictionary reference for replacing special characters
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

The above code block will return a dictionary consisting of the integer codes of all punctuations in ‘string.punctuation’ as keys and ‘None’ as values as shown below:


out: {33: None, 34: None, 35: None, 36: None, 37: None, 38: None, 39: None, 40: None, 41: None, 42: None, 43: None, 44: None, 45: None, 46: None, 47: None, 58: None, 59: None, 60: None, 61: None, 62: None, 63: None, 64: None, 91: None, 92: None, 93: None, 94: None, 95: None, 96: None, 123: None, 124: None, 125: None, 126: None}

In the following code block, we will define a method called LemNormalize to use the LemTokens function and remove_punct_dict dictionary that we defined earlier for lemmatizing and to clean and tokenize the text.

#Method to clean up and tokenize the text
def LemNormalize(text):
   return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

The translate method will convert all punctuations defined in the keys of remove_punct_dict to their respective values which are None. For example, "hi!!..".translate(remove_punct_dict) will return Out: 'hi'

We now have methods to clean up the data. So we can proceed to the actual processing.

In the following code, we will convert the sentences into the bag of words model using the CountVectorizer method. Then we will check for the cosine similarity between the users’ input and the sentences we have in the bag of words.

#Importing the libraries for cosine_similarity & CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

We can now define a method to take up the user input and check for the similarity with the sentences in the text.

1 def response(user_response):
2        sentences.append(user_response)
3        cv = CountVectorizer(max_features = 50, tokenizer = LemNormalize, analyzer = 'word')
4        X = cv.fit_transform(sentences)
5        vals_cv = cosine_similarity(X[-1], X)
6        indx_of_most_similar_sentence = vals_cv.argsort()[0][-2] #sorting the indexes based on increasing similarity
7        flat_vals_cv = vals_cv.flatten()
8        flat_vals_cv.sort()
9        highest_similarity = flat_vals_cv[-2] # required tfidf = most similar to 4

10        if(highest_similarity == 0):
              robo_response = "I am sorry! I don't understand you"
              return robo_response
11        else:
              robo_response = sentences[indx_of_most_similar_sentence]
              return robo_response

Compare the above code with following pseudo code to understand the process at each line.

1)Function definition:- Define the response function.

2)Adding the user input to the sentences corpus.

3)Initializing the CountVectorizer with parameters max_features = 50, tokenizer = LemNormalize and analyzer = ‘word’.The count vectorizer will create a matrix with each word as columns, rows representing sentences and the values representing the counts of each word in each sentence.

The max_features = 50 will select 50 words from the sentences corpus as columns or features and each sentence as a row. For example, if the text has 4 sentences, then the CountVectorizer will create a vector of shape 4×50.

4)Transforming the sentences corpus into count_vectorizer X.

5)Calculating the cosine similarity of the last sentence (user input) with the entire CountVerter X.

Cosine Similarity is calculated as the ratio between the dot products of the occurrence and the product of the magnitude of occurrences of terms. This will yield an array of length 4 for a text containing 4 sentences (the 4th sentence is the user input) with the cosine similarity as its elements. The last sentence will always have the highest cosine similarity as it is the user input.

6)Sorting the indexes of the array with cosine similarities in increasing order and taking the second last element. This will give the index of the most similar sentence.

7)Flattening the cosine similarity array into a vector of rows.

8)Sorting the values in increasing order of cosine similarities.

9)Storing the second highest cosine similarity value.

10)If the second highest cosine similarity value is zero it means there is no match and chatty gives out a message saying it can not understand the user’s query.

11)Otherwise chatty displays the matched line from the text to the user.

The following code block runs a loop to keep on the chat session until the user quits with any of the exit codes or answers no to the question of whether to continue the session or not. Each time the user inputs a sentence or word, it is passed to the ‘response’ method that we explained above and returns a match if it finds similar sentences otherwise displays a default message.

exit_codes = ['bye', 'see you', 'c ya', 'exit']
print("Hi! Im a Chatty, I will try to answer your queries !")

        user_response = input("User:")
        if user_response.lower() not in exit_codes:
        user_response = user_response.lower()
        print("chatty :", response(user_response))
        print('\nDo you want to continue ? (yes/no)')
        user_response = input("User-:yes/no? ")

        if user_response.lower() == 'no' or user_response.lower() == 'NO' or user_response.lower() in exit_codes :

        else :

Here is a sample chat session with Chatty:

That’s it! Have a fun time with Chatty and Happy coding!

More Great AIM Stories

Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact:

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM