Natural language processing has many different applications like Text Classification, Informal Retrieval, POS Tagging, etc. Almost all tasks in NLP, we need to deal with a large volume of texts. Since machines do not understand the text we need to transform it in a way that machine can interpret it. Therefore we convert texts in the form of vectors. There are many different methods to do this conversion like count vectorizer, TF-IDF vectorizer, and also Keras have tokenizers that serve the same purpose.
In this article, we will explore Keras tokenizer through which we will convert the texts into sequences that can be further fed to the predictive model. To do this we will make use of the Reuters data set that can be directly imported from the Keras library or can be downloaded from Kaggle. This data set contains 11,228 newswires from Reuters having 46 topics as labels. We will make use of different modes present in Keras tokenizer and will build deep neural networks for classification.
What we will learn from this article?
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
- How to use Keras Tokenizer?
- What are different modes in Keras Tokenizer?
- How to build classification models over the Reuters data set?
- Model Performance for Different Modes Of Tokenization
We will first import all the required libraries that are required and Reuters data from Keras library. Use the below code to the same.
import keras from keras.datasets import reuters from keras.models import Sequential from keras.layers import Dense, Dropout, Activation from keras.preprocessing.text import Tokenizer import tensorflow as tf (X_train,y_train),(X_test,y_test) = reuters.load_data()
Now we will check about the shape of training and testing data. Use the below code to check the same.
Now we will check about the shape of training and testing data. Use the below code to check the same.
print(X_train.shape)
print(X_test.shape)
Output:
Now we will first tokenize the corpus with keeping only 50000 words and then convert training and testing to the sequence of matrices using binary mode. We also need to convert the training and testing labels categorically to having a total of 46 classes. Use the below code to all the transformations.
- Binary Mode For Converting Sequence To Matrix
tokenizer = Tokenizer(num_words=50000) X_train = tokenizer.sequences_to_matrix(X_train, mode='binary') X_test = tokenizer.sequences_to_matrix(X_test, mode='binary') y_train = keras.utils.to_categorical(y_train,num_classes=46) y_test = keras.utils.to_categorical(y_test,num_classes=46)
Since we are done with all the required transformation we will now define the network for classification. Use the below code for defining the model network. Also, we can design different model networks of other architecture as well.
model = Sequential() model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape))) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(tf.keras.layers.BatchNormalization()) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(Dropout(0.25)) model.add(tf.keras.layers.Dense(46, activation='softmax'))
Now we will see the model summary. Use the below code to check it.
print(model.summary())
Output:
Now we will compile the model using optimizer as stochastic gradient descent, loss as cross-entropy and metrics to measure the performance would be accuracy. After compiling we will train the model and check the performance on validation data. We are taking a batch size of 64 and epochs to be 10. Use the below code to the same.
model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)
Output:
After the training, we will check the performance of the model built by binary mode. Use the below code for the same. We see this mode of the model has given us 79% accuracy.
model.evaluate(X_test,y_test)
Output:
2. Count Mode For Converting Sequence To Matrix
Now we will build the same model with the count mode of the tokenizer. Use the below code for the same.
(X_train,y_train),(X_test,y_test) = reuters.load_data() X_train = tokenizer.sequences_to_matrix(X_train, mode='count') X_test = tokenizer.sequences_to_matrix(X_test, mode='count') y_train = keras.utils.to_categorical(y_train,num_classes=46) y_test = keras.utils.to_categorical(y_test,num_classes=46) model = Sequential() model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape))) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(tf.keras.layers.BatchNormalization()) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(Dropout(0.25)) model.add(tf.keras.layers.Dense(46, activation='softmax'))
We will compile this model keeping all parameters the same. Use the below code to compile it, train the network and compute the performance.
model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])
model.evaluate(X_test,y_test)
3. Frequency Mode For Converting Sequence To Matrix
We now build the same network with a freq mode of tokenizer keeping every other parameter to be the same.
(X_train,y_train),(X_test,y_test) = reuters.load_data() X_train = tokenizer.sequences_to_matrix(X_train, mode='freq') X_test = tokenizer.sequences_to_matrix(X_test, mode='freq') y_train = keras.utils.to_categorical(y_train,num_classes=46) y_test = keras.utils.to_categorical(y_test,num_classes=46) model = Sequential() model.add(tf.keras.layers.Dense(128,input_shape=(X_train[0].shape))) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(tf.keras.layers.BatchNormalization()) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dense(512, activation='relu')) model.add(Dropout(0.25)) model.add(tf.keras.layers.Dense(46, activation='softmax'))
Use the below code to compile it, train the network and compute the performance.
model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)
model.evaluate(X_test,y_test)
4. TF-TDF Mode For Converting Sequence To Matrix
We now build the same network with the last mode that is the TF-IDF mode of tokenizer keeping every other parameter to be the same.
Use the below code to compile it, train the network and compute the performance.
model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=64,epochs=10)
model.evaluate(X_test,y_test)
Model Performance for Different Modes Of Tokenization
Mode | Accuracy on Validation Data. |
Binary | 79% |
Frequency | 78% |
Count | 54.8% |
Tf-IDF | 80.4% |
Conclusion
I would like to conclude the article by hoping that you now have understood the four different modes that are there in Keras tokenizers for converting sequence to the matrix. We build the classification model over Reuters data using different four different modes that were binary, frequency, count, and TF-IDF. All four modes have different functionalities. We can also fine-tune the build models using different hyperparameter tuning techniques.