Using Natural Language Processing To Check Word Frequency In ‘The Adventure of Sherlock Holmes’

Natural Language Processing is one of the most commonly used technique which is implemented in machine learning applications — given the wide range of analysis, extraction, processing and visualising tasks that it can perform. In this article, you will learn how to implement all of these aspects and present your project. The primary goal of this project is to tokenize the textual content, remove the stop words and find the high frequency words. We shall implement this in Python 3.6.4.

To start with, we shall look into the libraries that we are going to use:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
  1. Beautifulsoup: To scrape the data from the HTML of a website and it also helps to process only the text from these HTML codes
  2. Regular Expressions: Also known as Regex. It will convert the noise data containing special characters and carry the conversion of uppercase to lowercase characters
  3. NLTK (Natural Language Toolkit): For the tokenization of the sentences into a list of words

We are using the eBook for, The Adventure of Sherlock Holmes by Sir Arthur Conan Doylewhich is available here.




Let Us Grab The URL Of The Book And Start Our Project

Data Extraction:

Assign the url to an object as below,

Now, after we have the URL, let us try to make a request. Once you are go through the browser while visiting a web page, it shows request as below. requests make this easy with its function. Make the request here and check the object type returned. There are other types of requests, such as POST requests, but that is not of our concern for this project.

After getting the html script from the link, let us process this html to get the text from the body.

Text Extraction From HTML:

We shall make use of Beautifulsoup to extract the string of words from the html content. Let’s import the Beautifulsoup from bs4 and parse the html content with the argument “htmllib”. You can also use other parameters such as “lxml”, “html” etc.

Let us look at the title of the eBook, to learn more about the functioning of the Beautifulsoup here.

To extract just the string from the contents inside the title tag, follow this code

Let us take a look at all the chapter available inside the book and how they are represented in HTML code.

This is the output that we are looking for. The complete Sherlock Holmes’ eBook textual content can be access with .get_text() command.

Now that you have the text of interest, it’s time for you to count how many times each word appears and to plot the frequency histogram that you want. This is where Natural Language Processing comes into picture.

Extract Words From Your Text With NLP:

We’ll now use nltk, the Natural Language Toolkit, to

  1. Tokenise the text (splitting sentences into words (list of words));
  2. Remove stopwords (remove words such as ‘a’ and ‘the’ that occur at a great frequency).

We will be using the regular expressions first, to remove all the unwanted data from the text.

  • the ‘\w’ is a special character that will match any alphanumeric A-z, a-z, 0-9, along with underscores;
  • The ‘+’ tells you that the previous character in the regex can appear as many times as you want in strings that you;re trying to match. This means that ‘\w+’ will match arbitrary sequences of alphanumeric characters and underscores.

Let us now convert all the uppercase letters to lowercase letters, which is a mandatory task because in Python, uppercase and lowercase are considered as different objects.

Removal Of Stop Words:

It is common practice to remove words that appear frequently in the English language such as ‘the’, ‘of’ and ‘a’ (known as stopwords) because they’re not so interesting.

The package nltk has a list of stopwords in English which you’ll now store as sw and of which you’ll print the first several elements.

If you get an error here, run the command nltk.download (‘stopwords’) to install the stopwords on your system.

Now we need to remove all the words that are now in sw  from the original text to complete the NLTK extraction and processing.

Presenting The Project:

With the help of seaborn and matplotlib, let us visualise how the data is scattered and present our NLP model on the book The Adventures of Sherlock Holmes by Arthur Conan Doyle.

Let us now look at how the graph looks and also the tokenised word count. Here we will be ending our model and finally present our findings with the graph below.

Kishan Maladkar
Kishan Maladkar holds a degree in Electronics and Communication Engineering, exploring the field of Machine Learning and Artificial Intelligence. A Data Science Enthusiast who loves to read about the computational engineering and contribute towards the technology shaping our world. He is a Data Scientist by day and Gamer by night.

Download our Mobile App

MachineHack

AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.