Hands-On Guide To Detecting SMS Spam Using Natural Language Processing

In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam.The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is spam or not.
spam-classification-image

In recent times, the internet and social media have become the fastest and easiest ways to get information. Today messages, reviews and opinions have become a significant source of information. In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam. Thanks to advancement in technologies, we are now able to extract meaningful information from such data using various artificial intelligence techniques. In order to deal with such problems, natural Language Processing, a part of data science is used to give valuable insights.

The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is a spam or not.

About the Dataset

The data can be downloaded from here. It contains 5573 rows and 2 columns. Each row represents the message in the text is spam or ham(not spam).

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Code Implementation

The code is implemented in google colab and .pynb file is downloaded.

Install all the packages

#Install Packages
pip install wordcloud
%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve

Read the data

 Importing the Dataset spam.csv.We need to remove the unwanted columns.

data = pd.read_csv('data/spam.csv', encoding='latin-1')

 The data has ham and spam messages labelled.

The distribution of ham and spam messages looks like:-

 Overall distribution of spam and ham messages

 #Overall length of length of spam and ham messages
 data.hist(column='length', by='label', bins=100, figsize=(20,7))

Creating a corpus of spam and ham messages

ham_words = ''
spam_words = ''
import nltk
nltk.download('punkt')
# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          spam_words = spam_words + words + ' '
# Creating a corpus of ham messages        
for val in data[data['label'] == 'ham'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          ham_words = ham_words + words + ' '

Creating Spam and Ham word clouds

Creating a word cloud of spam messages.  Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)
ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

 Spam Wordcloud

#Creating Ham wordcloud
plt.figure( figsize=(10,8), facecolor='g')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

                                                                  Ham Wordcloud

Data pre-processing of SMS Spam

Removing punctuations and stopwords from the text data.

import string
def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
      return " ".join(text)
data['text'] = data['text'].apply(text_process)

Converting text to vectors

Now we will proceed by converting the text to vectors for the model to easily classify it. Two such techniques are Bag of Words and TF-IDF Vectorizer. The basic requirements would be it should not result in the sparse matrix and it should retain most of the linguistic information. The problem with a bag of words is that it assigns the same importance value(Weights) to all the words. This is resolved when we TF-IDF as it assigns different weights to the words.

# Text to Vector
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)
      # Convert all titles to vectors
      word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
      for i, (_, text_) in enumerate(text.iterrows()):
       word_vectors[i] = text_to_vector(text_[0])

#Converting words to vector using TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape
 
#features = word_vectors
features = vectors

Split the data using sklearn library

Splitting the data into train test and applying machine learning models to it. Further, we will split the data into training sets and testing sets. 85% of data were used for training and 15% for testing purposes.

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)

Training using multiple machine learning models

#Training multiple machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc}
def train(clf, features, targets):    
    clf.fit(features, targets)
def predict(clf, features):
    return (clf.predict(features))
pred_scores_word_vectors = []
for k,v in clfs.items():
    train(v, X_train, y_train)
    pred = predict(v, X_test)
    pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))

Score for all the machine learning models

Pred_scores_word_vectors

SMS Spam

Model Prediction

def find(x):
    if x == 1:
        print ("Message is SPAM")
    else:
        print ("Message is NOT Spam")
text = ["Free tones Hope you enjoyed your new content"]
integers = vectorizer.transform(text)
x = mnb.predict(integers)[0]
find(x) 
SMS Spam

Final Thoughts

We used various machine learning algorithms to classify the text message and compared accuracy set across these models. Naive Bayes classifier gives the best result among all with an accuracy of over 98%. This article provides an overview of using different techniques to classify a text message as “spam” or “not”.Further, we can explore deep learning models, LSTM and Bi-LSTM to get a better result. 

The complete code can be found at the AIM’s GitHub repository. Please visit this link to find code.

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR