Last updated May 23, 2021
In AI Mysteries

Hands-On Guide To Detecting SMS Spam Using Natural Language Processing

In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam.The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is spam or not.

Published on October 20, 2020

by Ankit Das

In recent times, the internet and social media have become the fastest and easiest ways to get information. Today messages, reviews and opinions have become a significant source of information. In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam. Thanks to advancement in technologies, we are now able to extract meaningful information from such data using various artificial intelligence techniques. In order to deal with such problems, natural Language Processing, a part of data science is used to give valuable insights.

The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is a spam or not.

About the Dataset

The data can be downloaded from here. It contains 5573 rows and 2 columns. Each row represents the message in the text is spam or ham(not spam).

Code Implementation

The code is implemented in google colab and .pynb file is downloaded.

Install all the packages

#Install Packages
pip install wordcloud
%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve

Read the data

Importing the Dataset spam.csv.We need to remove the unwanted columns.

data = pd.read_csv('data/spam.csv', encoding='latin-1')

The data has ham and spam messages labelled.

The distribution of ham and spam messages looks like:-

Overall distribution of spam and ham messages

 #Overall length of length of spam and ham messages
 data.hist(column='length', by='label', bins=100, figsize=(20,7))

Creating a corpus of spam and ham messages

ham_words = ''
spam_words = ''
import nltk
nltk.download('punkt')
# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          spam_words = spam_words + words + ' '
# Creating a corpus of ham messages        
for val in data[data['label'] == 'ham'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          ham_words = ham_words + words + ' '

Creating Spam and Ham word clouds

Creating a word cloud of spam messages. Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)
ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)

#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Spam Wordcloud

#Creating Ham wordcloud
plt.figure( figsize=(10,8), facecolor='g')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Ham Wordcloud

Data pre-processing of SMS Spam

Removing punctuations and stopwords from the text data.

import string
def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
      return " ".join(text)
data['text'] = data['text'].apply(text_process)

Converting text to vectors

Now we will proceed by converting the text to vectors for the model to easily classify it. Two such techniques are Bag of Words and TF-IDF Vectorizer. The basic requirements would be it should not result in the sparse matrix and it should retain most of the linguistic information. The problem with a bag of words is that it assigns the same importance value(Weights) to all the words. This is resolved when we TF-IDF as it assigns different weights to the words.

# Text to Vector
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)
      # Convert all titles to vectors
      word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
      for i, (_, text_) in enumerate(text.iterrows()):
       word_vectors[i] = text_to_vector(text_[0])

#Converting words to vector using TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape
 
#features = word_vectors
features = vectors

Split the data using sklearn library

Splitting the data into train test and applying machine learning models to it. Further, we will split the data into training sets and testing sets. 85% of data were used for training and 15% for testing purposes.

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)

Training using multiple machine learning models

#Training multiple machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc}
def train(clf, features, targets):    
    clf.fit(features, targets)
def predict(clf, features):
    return (clf.predict(features))
pred_scores_word_vectors = []
for k,v in clfs.items():
    train(v, X_train, y_train)
    pred = predict(v, X_test)
    pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))

Score for all the machine learning models

Pred_scores_word_vectors

Model Prediction

def find(x):
    if x == 1:
        print ("Message is SPAM")
    else:
        print ("Message is NOT Spam")
text = ["Free tones Hope you enjoyed your new content"]
integers = vectorizer.transform(text)
x = mnb.predict(integers)[0]
find(x)

Final Thoughts

We used various machine learning algorithms to classify the text message and compared accuracy set across these models. Naive Bayes classifier gives the best result among all with an accuracy of over 98%. This article provides an overview of using different techniques to classify a text message as “spam” or “not”.Further, we can explore deep learning models, LSTM and Bi-LSTM to get a better result.

The complete code can be found at the AIM’s GitHub repository. Please visit this link to find code.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Wayve AI Introduces LINGO-2, Making Driving Easy with Natural Language

In 5 Years, Coding will be Done in Natural Language

Democratize data analysis and insights generation through the seamless translation of Natural Language into SQL queries

Did OpenAI Purposely Discontinue its AI Classifier?

6 Best Libraries and Frameworks for SCM Use Cases

First Trillion Parameter Model on HuggingFace – Mixture of Experts (MoE)

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the