MITB Banner

Hands-On Guide To Detecting SMS Spam Using Natural Language Processing

In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam.The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is spam or not.
Share
spam-classification-image

In recent times, the internet and social media have become the fastest and easiest ways to get information. Today messages, reviews and opinions have become a significant source of information. In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam. Thanks to advancement in technologies, we are now able to extract meaningful information from such data using various artificial intelligence techniques. In order to deal with such problems, natural Language Processing, a part of data science is used to give valuable insights.

The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is a spam or not.

About the Dataset

The data can be downloaded from here. It contains 5573 rows and 2 columns. Each row represents the message in the text is spam or ham(not spam).

Code Implementation

The code is implemented in google colab and .pynb file is downloaded.

Install all the packages

#Install Packages
pip install wordcloud
%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve

Read the data

 Importing the Dataset spam.csv.We need to remove the unwanted columns.

data = pd.read_csv('data/spam.csv', encoding='latin-1')

 The data has ham and spam messages labelled.

The distribution of ham and spam messages looks like:-

 Overall distribution of spam and ham messages

 #Overall length of length of spam and ham messages
 data.hist(column='length', by='label', bins=100, figsize=(20,7))

Creating a corpus of spam and ham messages

ham_words = ''
spam_words = ''
import nltk
nltk.download('punkt')
# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          spam_words = spam_words + words + ' '
# Creating a corpus of ham messages        
for val in data[data['label'] == 'ham'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          ham_words = ham_words + words + ' '

Creating Spam and Ham word clouds

Creating a word cloud of spam messages.  Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)
ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

 Spam Wordcloud

#Creating Ham wordcloud
plt.figure( figsize=(10,8), facecolor='g')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

                                                                  Ham Wordcloud

Data pre-processing of SMS Spam

Removing punctuations and stopwords from the text data.

import string
def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
      return " ".join(text)
data['text'] = data['text'].apply(text_process)

Converting text to vectors

Now we will proceed by converting the text to vectors for the model to easily classify it. Two such techniques are Bag of Words and TF-IDF Vectorizer. The basic requirements would be it should not result in the sparse matrix and it should retain most of the linguistic information. The problem with a bag of words is that it assigns the same importance value(Weights) to all the words. This is resolved when we TF-IDF as it assigns different weights to the words.

# Text to Vector
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)
      # Convert all titles to vectors
      word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
      for i, (_, text_) in enumerate(text.iterrows()):
       word_vectors[i] = text_to_vector(text_[0])

#Converting words to vector using TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape
 
#features = word_vectors
features = vectors

Split the data using sklearn library

Splitting the data into train test and applying machine learning models to it. Further, we will split the data into training sets and testing sets. 85% of data were used for training and 15% for testing purposes.

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)

Training using multiple machine learning models

#Training multiple machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc}
def train(clf, features, targets):    
    clf.fit(features, targets)
def predict(clf, features):
    return (clf.predict(features))
pred_scores_word_vectors = []
for k,v in clfs.items():
    train(v, X_train, y_train)
    pred = predict(v, X_test)
    pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))

Score for all the machine learning models

Pred_scores_word_vectors

SMS Spam

Model Prediction

def find(x):
    if x == 1:
        print ("Message is SPAM")
    else:
        print ("Message is NOT Spam")
text = ["Free tones Hope you enjoyed your new content"]
integers = vectorizer.transform(text)
x = mnb.predict(integers)[0]
find(x) 
SMS Spam

Final Thoughts

We used various machine learning algorithms to classify the text message and compared accuracy set across these models. Naive Bayes classifier gives the best result among all with an accuracy of over 98%. This article provides an overview of using different techniques to classify a text message as “spam” or “not”.Further, we can explore deep learning models, LSTM and Bi-LSTM to get a better result. 

The complete code can be found at the AIM’s GitHub repository. Please visit this link to find code.

PS: The story was written using a keyboard.
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed