MITB Banner

Tutorial On Keras Tokenizer For Text Classification in NLP

In this article, we will explore Keras tokenizer through which we will convert the texts into sequences that can be further fed to the predictive model.

Share

keras tokenizer

Natural language processing has many different applications like Text Classification, Informal Retrieval, POS Tagging, etc. Almost all tasks in NLP, we need to deal with a large volume of texts. Since machines do not understand the text we need to transform it in a way that machine can interpret it. Therefore we convert texts in the form of vectors. There are many different methods to do this conversion like count vectorizer, TF-IDF vectorizer, and also Keras have tokenizers that serve the same purpose.

In this article, we will explore Keras tokenizer through which we will convert the texts into sequences that can be further fed to the predictive model. To do this we will make use of the Reuters data set that can be directly imported from the Keras library or can be downloaded from Kaggle. This data set contains  11,228 newswires from Reuters having 46 topics as labels. We will make use of different modes present in Keras tokenizer and will build deep neural networks for classification.

What we will learn from this article?

  • How to use Keras Tokenizer?
  • What are different modes in Keras Tokenizer? 
  • How to build classification models over the Reuters data set? 
  • Model Performance for Different Modes Of Tokenization

We will first import all the required libraries that are required and Reuters data from Keras library. Use the below code to the same. 

import keras 
from keras.datasets import reuters
from keras.models import Sequential 
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import tensorflow as tf
(X_train,y_train),(X_test,y_test) = reuters.load_data()

Now we will check about the shape of training and testing data. Use the below code to check the same.

Now we will check about the shape of training and testing data. Use the below code to check the same. 

print(X_train.shape)

print(X_test.shape)

Output:

Now we will first tokenize the corpus with keeping only 50000 words and then convert training and testing to the sequence of matrices using binary mode. We also need to convert the training and testing labels categorically to having a total of 46 classes. Use the below code to all the transformations. 

  1. Binary Mode For Converting Sequence To Matrix
tokenizer = Tokenizer(num_words=50000)
X_train = tokenizer.sequences_to_matrix(X_train, mode='binary')
X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')
y_train = keras.utils.to_categorical(y_train,num_classes=46)
y_test = keras.utils.to_categorical(y_test,num_classes=46)

Since we are done with all the required transformation we will now define the network for classification. Use the below code for defining the model network. Also, we can design different model networks of other architecture as well. 

model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(46, activation='softmax'))

Now we will see the model summary. Use the below code to check it.

print(model.summary())

Output:

Now we will compile the model using optimizer as stochastic gradient descent, loss as cross-entropy and metrics to measure the performance would be accuracy. After compiling we will train the model and check the performance on validation data. We are taking a batch size of 64 and epochs to be 10. Use the below code to the same. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)

Output:

After the training, we will check the performance of the model built by binary mode. Use the below code for the same. We see this mode of the model has given us 79% accuracy. 

model.evaluate(X_test,y_test)

Output:

2. Count Mode For Converting Sequence To Matrix 

Now we will build the same model with the count mode of the tokenizer.  Use the below code for the same.

(X_train,y_train),(X_test,y_test) = reuters.load_data()
 
X_train = tokenizer.sequences_to_matrix(X_train, mode='count')
X_test = tokenizer.sequences_to_matrix(X_test, mode='count')
 
y_train = keras.utils.to_categorical(y_train,num_classes=46)
y_test = keras.utils.to_categorical(y_test,num_classes=46)
 
model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(46, activation='softmax'))

We will compile this model keeping all parameters the same. Use the below code to compile it, train the network and compute the performance. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.evaluate(X_test,y_test)

3. Frequency Mode For Converting Sequence To Matrix

We now build the same network with a freq mode of tokenizer keeping every other parameter to be the same.

(X_train,y_train),(X_test,y_test) = reuters.load_data()
 
X_train = tokenizer.sequences_to_matrix(X_train, mode='freq')
X_test = tokenizer.sequences_to_matrix(X_test, mode='freq')
 
y_train = keras.utils.to_categorical(y_train,num_classes=46)
y_test = keras.utils.to_categorical(y_test,num_classes=46)
 
model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(46, activation='softmax'))

Use the below code to compile it, train the network and compute the performance. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)

Keras Tokenizer

model.evaluate(X_test,y_test)

Keras Tokenizer

4. TF-TDF Mode For Converting Sequence To Matrix 

We now build the same network with the last mode that is the TF-IDF mode of tokenizer keeping every other parameter to be the same. 

Use the below code to compile it, train the network and compute the performance. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)

Keras Tokenizer

model.evaluate(X_test,y_test)

Keras Tokenizer


Model Performance for Different Modes Of Tokenization

Mode Accuracy on Validation Data.
Binary79%
Frequency78%
Count 54.8%
Tf-IDF80.4%

Conclusion 

I would like to conclude the article by hoping that you now have understood the four different modes that are there in Keras tokenizers for converting sequence to the matrix. We build the classification model over Reuters data using different four different modes that were binary, frequency, count, and TF-IDF. All four modes have different functionalities. We can also fine-tune the build models using different hyperparameter tuning techniques.

Share
Picture of Rohit Dwivedi

Rohit Dwivedi

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.