MITB Banner

Guide To Text Classification using TextCNN

Text classification is a process of providing labels to the set of texts or words in one, zero or predefined labels format, and those labels will tell us about the sentiment of the set of words.
Share

Nowadays, many actions are needed to perform using text classification like hate classification, speech detection, sentiment classification etc. This article’s main focus is to perform text classification and sentiment analysis for three combined datasets amazon review, imdb movie rating and yelp review data sets using . Before going to the coding, let’s just have some basics of text classification and convolutional neural networks.

Introduction to Text Classification 

What is Text Classification?

Text classification is a process of providing labels to the set of texts or words in one, zero or predefined labels format, and those labels will tell us about the sentiment of the set of words. 

First of all, the human language is nothing but a combination of words. Whenever spoken by the human it comes out with a sentiment that another human can easily understand. Humans easily understand whether a sentence has anger or it has any other mood. Making a machine to understand the human language is called text classification.

To perform text classification, we need already classified data; here in this article, the data used is provided with the labels. So here we are, trying to make a model with three data sets; as I said before, every piece of data has sentences with labels 0 and 1. At the end of the building model, the model will try to classify sentences according to their sentiment. 

This model will take text as an input and analyze input information and assign labels to them.

Let’s just make a simple logistic model for understanding it more.

In this article, I am using google colab.

First of all, we will import the data using pandas.

Input:

 import pandas as pd
 filepath_dire = {'yelp':   '/content/drive/MyDrive/Yugesh/TextCNN/yelp_labelled.txt',
                  'amazon': '/content/drive/MyDrive/Yugesh/TextCNN/amazon_cells_labelled.txt',
                  'imdb':   '/content/drive/MyDrive/Yugesh/TextCNN/imdb_labelled.txt'}
 data_list = []
 for source, filepath in filepath_dire.items():
     data = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
     data['source'] = source  # Add another column filled with the source name
     data_list.append(data)
 data = pd.concat(data_list)
 print(data) 

Output:

It seems like we have imported every dataset correctly. So let’s just move towards the model building, but we need to perform some preprocessing before fitting a model to a dataset.

Let’s just think about the process of the model, how it will work internally, how it will calculate the label from the text. Internal things are dependent on mathematical evaluation and calculation. Here also the data, the model’s input needs to be in matrices or numeric values formats that the model can easily calculate.

Transformation of the words can be done in many ways. One of them is to count the number of occurrences of every word in each sentence and provide those counts to the entire set of words in the dataset. This kind of collection of information is called corpus in NLP

And another method is to make a vocabulary where every word has its special index number.  More formally, we can say classifying every word into its tied index. 

This can be easily done by using the CountVectorizer provided by the scikit-learn library. Lets look at an example.

Input :

example = ['analytics india magazine is good magazine ', 'analytics india magazine provides good information']

Next, we can vectorize the sentence using Countvectorizer.

Input:

 from sklearn.feature_extraction.text import CountVectorizer
 examplevectorizer = CountVectorizer()
 examplevectorizer.fit(example)
 examplevectorizer.vocabulary_ 

Output:

 {'analytics': 0,
  'good': 1,
  'india': 2,
  'information': 3,
  'is': 4,
  'magazine': 5,
  'provides': 6} 

This resulting vector is called a feature vector. Each word has its category in the feature vector, which can be represented in numeric terms.

The output in the example has a vocabulary with a special index provided to words. This feature vector later can be converted into an array of occurrence of words which we help to count the frequency of words in the data set.

Input:

examplevectorizer.transform(example).toarray()

Output:

 array([[1, 1, 1, 0, 1, 2, 0],
        [1, 1, 1, 1, 0, 1, 1]])
 

Let’s perform this to our data set. 

First, we need to split our data into train and test.

Input:

 from sklearn.model_selection import train_test_split
 review = data['sentence'].values
 label = data['label'].values
 review_train, review_test, label_train, label_test = train_test_split(
    review, label, test_size=0.25, random_state=1000) 

Vectorizing the split data.

Input:

 from sklearn.feature_extraction.text import CountVectorizer
 review_vectorizer = CountVectorizer()
 review_vectorizer.fit(review_train)
 Xlr_train = review_vectorizer.transform(review_train)
 Xlr_test  = review_vectorizer.transform(review_test)
 Xlr_train 

Output:

Here we can see that the matrix has 750 feature vectors, and each has 1714 dimensions, which is the size of the vocabulary.

Now the data is almost prepared for fitting in the model, lets perform the logistic regression model building and fitting in the data. 

Input:

 from sklearn.linear_model import LogisticRegression
 LRmodel = LogisticRegression()
 LRmodel.fit(Xlr_train, label_train)
 score = LRmodel.score(Xlr_test, label_test)
 print("Accuracy:", score) 

Output:

Accuracy: 0.8195050946142649

Here we have seen the text classification model with very basic levels. There are many methods to perform text classification. TextCNN is also a method that implies neural networks for performing text classification. First, let’s look at CNN; after that, we will use it for text classification.

Introduction to CNN 

Convolutional neural networks or CNN are among the most promising methods in developing machine learning models. For example, it performs so well in image classification and computer vision. 

What are Convolutional neural networks or CNN?

CNN is just a kind of neural network; its convolutional layer differs from other neural networks. To perform image classification, CNN goes through every corner, vector and dimension of the pixel matrix. Performing with this all features of a matrix makes CNN more sustainable to data of matrix form.

Convolutional layers consist of multiple features like detecting edges, corners, and multiple textures, making it a special tool for CNN to perform modeling. That layer slides across the image matrix and can detect its all features. This means each convolutional layer in the network can detect more complex features. As the feature expands, we need to expand the dimension of the convolutional layer.

We can consider text data as sequential data like data in time series, a one-dimensional matrix. We need to work with a one-dimensional convolution layer. The idea of the model is almost the same, but the data type and dimension of convolution layers changed. To work with TextCNN, we require a word embedding layer and a one-dimensional convolutional network

Word Embedding 

What is Word Embedding?

Word embedding represents the density of the word vector, unlike what we have done with the Countvectorizer. It is a different way to preprocess the data. This embedding can map semantically similar words. It does not consider the text as a human language but maps the structure of sets of words used in the corpus. They aim to map words into a geometric space which is called an embedding space.

If embedding finds a good relationship between works like for an example 

King – man + women = queen 

Keras provides a couple of methods for text preprocessing and sequence preprocessing. We can use them to make our data a better fit for the TextCNN model.

Let’s prepare word embeddings for the model.

input:

 from keras.preprocessing.text import Tokenizer
 tokenizer = Tokenizer(num_words=5000)
 tokenizer.fit_on_texts(review_train)
 Xcnn_train = tokenizer.texts_to_sequences(review_train)
 Xcnn_test = tokenizer.texts_to_sequences(review_test)
 vocab_size = len(tokenizer.word_index) + 1  
 print(review_train[1])
 print(Xcnn_train[1]) 

Output:

Here we can see that most common words do not have a large index in our embedding space. Still, the extremely uncommon words will get a higher index value which will be word count + 1 because they hold some information. Those whose occurrence is moderate will be given a moderate index value. Finally, 0 value is reserved and won’t be provided to any text.

One problem is that in each sequence is the different length of words, and to specify the length of word sequence, we need to provide a mexlen parameter and to solve this, we need to use pad_sequence(), which simply pads the sequence of words with zeros.

Input:

 from keras.preprocessing.sequence import pad_sequences
 maxlen = 100
 Xcnn_train = pad_sequences(Xcnn_train, padding='post', maxlen=maxlen)
 Xcnn_test = pad_sequences(Xcnn_test, padding='post', maxlen=maxlen)
 print(Xcnn_train[0, :]) 

Output:

After padding, we have appended zero value to matrices, and now we can use those in a deep learning model. This is how word embedding makes relations between words. In the next step, we will try to fit the TextCNN model.

First of all, we need to import sequential and layers. 

Input:

 from keras.models import Sequential
 from keras import layers 

Making models using layers in it.

 embedding_dim = 200
 textcnnmodel = Sequential()
 textcnnmodel.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
 textcnnmodel.add(layers.Conv1D(128, 5, activation='relu'))
 textcnnmodel.add(layers.GlobalMaxPooling1D())
 textcnnmodel.add(layers.Dense(10, activation='relu'))
 textcnnmodel.add(layers.Dense(1, activation='sigmoid'))
 textcnnmodel.compile(optimizer='adam',
               loss='binary_crossentropy',
               metrics=['accuracy'])
 textcnnmodel.summary() 

Output:

 Model: "sequential_1"
 _________________________________________________________________
 Layer (type)                 Output Shape              Param #   
 =================================================================
 embedding_1 (Embedding)      (None, 100, 200)          920600    
 _________________________________________________________________
 conv1d_1 (Conv1D)            (None, 96, 128)           128128    
 _________________________________________________________________
 global_max_pooling1d_1 (Glob (None, 128)               0         
 _________________________________________________________________
 dense_2 (Dense)              (None, 10)                1290      
 _________________________________________________________________
 dense_3 (Dense)              (None, 1)                 11        
 =================================================================
 Total params: 1,050,029
 Trainable params: 1,050,029
 Non-trainable params: 0
 _________________________________________________________________ 

 Let’s fit the model and check for accuracy.

Input :

 textcnnmodel.fit(Xcnn_train, label_train,
                     epochs=10,
                     verbose=False,
                     validation_data=(Xcnn_test, label_test),
                     batch_size=10)
 loss, accuracy = textcnnmodel.evaluate(Xcnn_train, label_train, verbose=False)
 print("Training Accuracy: {:.4f}".format(accuracy))
 loss, accuracy = textcnnmodel.evaluate(Xcnn_test, label_test, verbose=False)
 print("Testing Accuracy:  {:.4f}".format(accuracy)) 

Output:

 Training Accuracy: 1.0000
 Testing Accuracy:  0.8040 

Here we can see that our model is overfitted at training, but test accuracy is decent. Hence,  there are many ways to improve the model. In this article, we have not performed the cleaning of the data, and the CNN requires a large amount of data to train better, so as the sample will increase, it might perform better. But after overfitting too, it gave quite good results.

There are many cases where we will need to use this for a large and complex data set. It is well suggested that our simple classification model can not perform well. For example, in complex datasets where the making of vocabulary increases the size of matrices, we can use this model because we know that it looks for relationships between words.    

PS: The story was written using a keyboard.
Share
Picture of Yugesh Verma

Yugesh Verma

Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India