In recent times, the internet and social media have become the fastest and easiest ways to get information. Today messages, reviews and opinions have become a significant source of information. In this era, Short message service or SMS is considered one of the most powerful means of communication. As the dependence on mobile devices has drastically increased over the period of time it has led to an increased number of attacks in the form of SMS Spam. Thanks to advancement in technologies, we are now able to extract meaningful information from such data using various artificial intelligence techniques. In order to deal with such problems, natural Language Processing, a part of data science is used to give valuable insights.
The main aim of this article is to understand how to build an SMS spam detection model. We will build a binary classification model to detect whether a text message is a spam or not.
About the Dataset
The data can be downloaded from here. It contains 5573 rows and 2 columns. Each row represents the message in the text is spam or ham(not spam).
Code Implementation
The code is implemented in google colab and .pynb file is downloaded.
Install all the packages
#Install Packages pip install wordcloud %matplotlib inline import matplotlib.pyplot as plt import csv import sklearn import pickle from wordcloud import WordCloud import pandas as pd import numpy as np import nltk from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve
Read the data
Importing the Dataset spam.csv.We need to remove the unwanted columns.
data = pd.read_csv('data/spam.csv', encoding='latin-1')
The data has ham and spam messages labelled.
The distribution of ham and spam messages looks like:-
Overall distribution of spam and ham messages
#Overall length of length of spam and ham messages data.hist(column='length', by='label', bins=100, figsize=(20,7))
Creating a corpus of spam and ham messages
ham_words = '' spam_words = '' import nltk nltk.download('punkt') # Creating a corpus of spam messages for val in data[data['label'] == 'spam'].text: text = val.lower() tokens = nltk.word_tokenize(text) for words in tokens: spam_words = spam_words + words + ' ' # Creating a corpus of ham messages for val in data[data['label'] == 'ham'].text: text = val.lower() tokens = nltk.word_tokenize(text) for words in tokens: ham_words = ham_words + words + ' '
Creating Spam and Ham word clouds
Creating a word cloud of spam messages. Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.
spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words) ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)
#Spam Word cloud plt.figure( figsize=(10,8), facecolor='w') plt.imshow(spam_wordcloud) plt.axis("off") plt.tight_layout(pad=0) plt.show()
Spam Wordcloud
#Creating Ham wordcloud plt.figure( figsize=(10,8), facecolor='g') plt.imshow(ham_wordcloud) plt.axis("off") plt.tight_layout(pad=0) plt.show()
Ham Wordcloud
Data pre-processing of SMS Spam
Removing punctuations and stopwords from the text data.
import string def text_process(text): text = text.translate(str.maketrans('', '', string.punctuation)) text = [word for word in text.split() if word.lower() not in stopwords.words('english')] return " ".join(text) data['text'] = data['text'].apply(text_process)
Converting text to vectors
Now we will proceed by converting the text to vectors for the model to easily classify it. Two such techniques are Bag of Words and TF-IDF Vectorizer. The basic requirements would be it should not result in the sparse matrix and it should retain most of the linguistic information. The problem with a bag of words is that it assigns the same importance value(Weights) to all the words. This is resolved when we TF-IDF as it assigns different weights to the words.
# Text to Vector def text_to_vector(text): word_vector = np.zeros(vocab_size) for word in text.split(" "): if word2idx.get(word) is None: continue else: word_vector[word2idx.get(word)] += 1 return np.array(word_vector) # Convert all titles to vectors word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_) for i, (_, text_) in enumerate(text.iterrows()): word_vectors[i] = text_to_vector(text_[0])
#Converting words to vector using TF-IDF Vectorizer from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(data['text']) vectors.shape #features = word_vectors features = vectors
Split the data using sklearn library
Splitting the data into train test and applying machine learning models to it. Further, we will split the data into training sets and testing sets. 85% of data were used for training and 15% for testing purposes.
# Train-Test Split X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)
Training using multiple machine learning models
#Training multiple machine learning algorithms from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.naive_bayes import MultinomialNB from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score svc = SVC(kernel='sigmoid', gamma=1.0) knc = KNeighborsClassifier(n_neighbors=49) mnb = MultinomialNB(alpha=0.2) dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111) lrc = LogisticRegression(solver='liblinear', penalty='l1') rfc = RandomForestClassifier(n_estimators=31, random_state=111) clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc} def train(clf, features, targets): clf.fit(features, targets) def predict(clf, features): return (clf.predict(features)) pred_scores_word_vectors = [] for k,v in clfs.items(): train(v, X_train, y_train) pred = predict(v, X_test) pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))
Score for all the machine learning models
Pred_scores_word_vectors
Model Prediction
def find(x): if x == 1: print ("Message is SPAM") else: print ("Message is NOT Spam") text = ["Free tones Hope you enjoyed your new content"] integers = vectorizer.transform(text) x = mnb.predict(integers)[0] find(x)

Final Thoughts
We used various machine learning algorithms to classify the text message and compared accuracy set across these models. Naive Bayes classifier gives the best result among all with an accuracy of over 98%. This article provides an overview of using different techniques to classify a text message as “spam” or “not”.Further, we can explore deep learning models, LSTM and Bi-LSTM to get a better result.
The complete code can be found at the AIM’s GitHub repository. Please visit this link to find code.