Now Reading
Hands-On Guide To Detecting SMS Spam Using Natural Language Processing

Hands-On Guide To Detecting SMS Spam Using Natural Language Processing


In recent times, the internet and social media have become the fastest and easiest ways to get information. Today messages, reviews and opinions have become a significant source of information. In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam. Thanks to advancement in technologies, we are now able to extract meaningful information from such data using various artificial intelligence techniques. In order to deal with such problems, natural Language Processing, a part of data science is used to give valuable insights.

The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is a spam or not.

About the Dataset

The data can be downloaded from here. It contains 5573 rows and 2 columns. Each row represents the message in the text is spam or ham(not spam).

Code Implementation

The code is implemented in google colab and .pynb file is downloaded.

Install all the packages

#Install Packages
pip install wordcloud
%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve

Read the data

 Importing the Dataset spam.csv.We need to remove the unwanted columns.

data = pd.read_csv('data/spam.csv', encoding='latin-1')

 The data has ham and spam messages labelled.

The distribution of ham and spam messages looks like:-

 Overall distribution of spam and ham messages

 #Overall length of length of spam and ham messages
 data.hist(column='length', by='label', bins=100, figsize=(20,7))

Creating a corpus of spam and ham messages

ham_words = ''
spam_words = ''
import nltk'punkt')
# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          spam_words = spam_words + words + ' '
# Creating a corpus of ham messages        
for val in data[data['label'] == 'ham'].text:
      text = val.lower()
      tokens = nltk.word_tokenize(text)
      for words in tokens:
          ham_words = ham_words + words + ' '

Creating Spam and Ham word clouds

Creating a word cloud of spam messages.  Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)
ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='w')

 Spam Wordcloud

#Creating Ham wordcloud
plt.figure( figsize=(10,8), facecolor='g')

                                                                  Ham Wordcloud

Data pre-processing of SMS Spam

Removing punctuations and stopwords from the text data.

See Also
How To Use Stanza By Stanford NLP Group

import string
def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
text = [word for word in text.split() if word.lower() not in stopwords.words('english')]
      return " ".join(text)
data['text'] = data['text'].apply(text_process)

Converting text to vectors

Now we will proceed by converting the text to vectors for the model to easily classify it. Two such techniques are Bag of Words and TF-IDF Vectorizer. The basic requirements would be it should not result in the sparse matrix and it should retain most of the linguistic information. The problem with a bag of words is that it assigns the same importance value(Weights) to all the words. This is resolved when we TF-IDF as it assigns different weights to the words.

# Text to Vector
def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)
      # Convert all titles to vectors
      word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
      for i, (_, text_) in enumerate(text.iterrows()):
       word_vectors[i] = text_to_vector(text_[0])

#Converting words to vector using TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
#features = word_vectors
features = vectors

Split the data using sklearn library

Splitting the data into train test and applying machine learning models to it. Further, we will split the data into training sets and testing sets. 85% of data were used for training and 15% for testing purposes.

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)

Training using multiple machine learning models

#Training multiple machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc}
def train(clf, features, targets):, targets)
def predict(clf, features):
    return (clf.predict(features))
pred_scores_word_vectors = []
for k,v in clfs.items():
    train(v, X_train, y_train)
    pred = predict(v, X_test)
    pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))

Score for all the machine learning models


SMS Spam

Model Prediction

def find(x):
    if x == 1:
        print ("Message is SPAM")
        print ("Message is NOT Spam")
text = ["Free tones Hope you enjoyed your new content"]
integers = vectorizer.transform(text)
x = mnb.predict(integers)[0]
SMS Spam

Final Thoughts

We used various machine learning algorithms to classify the text message and compared accuracy set across these models. Naive Bayes classifier gives the best result among all with an accuracy of over 98%. This article provides an overview of using different techniques to classify a text message as “spam” or “not”.Further, we can explore deep learning models, LSTM and Bi-LSTM to get a better result. 

The complete code can be found at the AIM’s GitHub repository. Please visit this link to find code.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top