Now Reading
Tutorial On Keras Tokenizer For Text Classification in NLP

Tutorial On Keras Tokenizer For Text Classification in NLP

keras tokenizer
W3Schools

Natural language processing has many different applications like Text Classification, Informal Retrieval, POS Tagging, etc. Almost all tasks in NLP, we need to deal with a large volume of texts. Since machines do not understand the text we need to transform it in a way that machine can interpret it. Therefore we convert texts in the form of vectors. There are many different methods to do this conversion like count vectorizer, TF-IDF vectorizer, and also Keras have tokenizers that serve the same purpose.

In this article, we will explore Keras tokenizer through which we will convert the texts into sequences that can be further fed to the predictive model. To do this we will make use of the Reuters data set that can be directly imported from the Keras library or can be downloaded from Kaggle. This data set contains  11,228 newswires from Reuters having 46 topics as labels. We will make use of different modes present in Keras tokenizer and will build deep neural networks for classification.

What we will learn from this article?



  • How to use Keras Tokenizer?
  • What are different modes in Keras Tokenizer? 
  • How to build classification models over the Reuters data set? 
  • Model Performance for Different Modes Of Tokenization

We will first import all the required libraries that are required and Reuters data from Keras library. Use the below code to the same. 

import keras 
from keras.datasets import reuters
from keras.models import Sequential 
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
import tensorflow as tf
(X_train,y_train),(X_test,y_test) = reuters.load_data()

Now we will check about the shape of training and testing data. Use the below code to check the same.

Now we will check about the shape of training and testing data. Use the below code to check the same. 

print(X_train.shape)

print(X_test.shape)

Output:

Now we will first tokenize the corpus with keeping only 50000 words and then convert training and testing to the sequence of matrices using binary mode. We also need to convert the training and testing labels categorically to having a total of 46 classes. Use the below code to all the transformations. 

  1. Binary Mode For Converting Sequence To Matrix
tokenizer = Tokenizer(num_words=50000)
X_train = tokenizer.sequences_to_matrix(X_train, mode='binary')
X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')
y_train = keras.utils.to_categorical(y_train,num_classes=46)
y_test = keras.utils.to_categorical(y_test,num_classes=46)

Since we are done with all the required transformation we will now define the network for classification. Use the below code for defining the model network. Also, we can design different model networks of other architecture as well. 

model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(46, activation='softmax'))

Now we will see the model summary. Use the below code to check it.

print(model.summary())

Output:

Now we will compile the model using optimizer as stochastic gradient descent, loss as cross-entropy and metrics to measure the performance would be accuracy. After compiling we will train the model and check the performance on validation data. We are taking a batch size of 64 and epochs to be 10. Use the below code to the same. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)

Output:

After the training, we will check the performance of the model built by binary mode. Use the below code for the same. We see this mode of the model has given us 79% accuracy. 

model.evaluate(X_test,y_test)

Output:

2. Count Mode For Converting Sequence To Matrix 

Now we will build the same model with the count mode of the tokenizer.  Use the below code for the same.

(X_train,y_train),(X_test,y_test) = reuters.load_data()
 
X_train = tokenizer.sequences_to_matrix(X_train, mode='count')
X_test = tokenizer.sequences_to_matrix(X_test, mode='count')
 
y_train = keras.utils.to_categorical(y_train,num_classes=46)
y_test = keras.utils.to_categorical(y_test,num_classes=46)
 
model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(46, activation='softmax'))

We will compile this model keeping all parameters the same. Use the below code to compile it, train the network and compute the performance. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.evaluate(X_test,y_test)

3. Frequency Mode For Converting Sequence To Matrix

We now build the same network with a freq mode of tokenizer keeping every other parameter to be the same.

See Also

(X_train,y_train),(X_test,y_test) = reuters.load_data()
 
X_train = tokenizer.sequences_to_matrix(X_train, mode='freq')
X_test = tokenizer.sequences_to_matrix(X_test, mode='freq')
 
y_train = keras.utils.to_categorical(y_train,num_classes=46)
y_test = keras.utils.to_categorical(y_test,num_classes=46)
 
model = Sequential()
model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(Dropout(0.25))
model.add(tf.keras.layers.Dense(46, activation='softmax'))

Use the below code to compile it, train the network and compute the performance. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)

Keras Tokenizer

model.evaluate(X_test,y_test)

Keras Tokenizer

4. TF-TDF Mode For Converting Sequence To Matrix 

We now build the same network with the last mode that is the TF-IDF mode of tokenizer keeping every other parameter to be the same. 

Use the below code to compile it, train the network and compute the performance. 

model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])

model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)

Keras Tokenizer

model.evaluate(X_test,y_test)

Keras Tokenizer


Model Performance for Different Modes Of Tokenization

Mode Accuracy on Validation Data.
Binary79%
Frequency78%
Count 54.8%
Tf-IDF80.4%

Conclusion 

I would like to conclude the article by hoping that you now have understood the four different modes that are there in Keras tokenizers for converting sequence to the matrix. We build the classification model over Reuters data using different four different modes that were binary, frequency, count, and TF-IDF. All four modes have different functionalities. We can also fine-tune the build models using different hyperparameter tuning techniques.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top