Natural Language Processing is one of the artificial intelligence tasks performed with natural languages. The word ‘natural’ refers to the languages that evolved naturally among humans for communication. A long-standing goal in artificial intelligence is to make a machine effectively communicate with humans. Language modeling and Language generation (such as neural machine translation) have been popular among researchers for over a decade. For an AI beginner, learning and practicing Natural Language Processing can be initialized with classification of texts. Sentiment Analysis is among the text classification applications in which a given text is classified into a positive class or a negative class (sometimes, a neutral class, too) based on the context. This article discusses sentiment analysis using TensorFlow Keras with the IMDB movie reviews dataset, one of the famous Sentiment Analysis datasets.
TensorFlow’s Keras API offers the complete functionality required to build and execute a deep learning model. This article assumes that the reader is familiar with the basics of deep learning and Recurrent Neural Networks (RNNs). Nevertheless, the following articles may yield a good understanding of deep learning and RNNs:
- Getting Started With Deep Learning Using TensorFlow Keras
- Implementing A Recurrent Neural Network (RNN) From Scratch
Create the Environment
Create the necessary Python environment by importing the frameworks and libraries.
# for array operations import numpy as np # deep learning framework import tensorflow as tf # to obtain IMDB datasets import tensorflow_datasets as tfds # Keras API from tensorflow import keras # import required layers from tensorflow.keras.layers import Dense, Dropout, Bidirectional, LSTM # to visualize the performance import matplotlib.pyplot as plt
Download the IMDB dataset
IMDB reviews dataset is available with TensorFlow Datasets in different variants:
- Plain text reviews,
- Byte-encoded texts,
- Integer-encoded texts with around 8k vocabulary
- Integer-encoded texts with around 32k vocabulary
Here, we use the dataset that has integer-encoded texts with around 8k vocabulary words.
data, meta = tfds.load('imdb_reviews/subwords8k', with_info = True, as_supervised = True)
Output:
What data are downloaded?
data.keys()
Output:
We do not require unsupervised data. Hence, we can obtain two datasets for train and test sets.
train = data['train'] test = data['test'] train, test
Output:
It can be observed that both texts and labels are integers. Moreover, texts are not of fixed length (since, no size is mentioned). The data is already preprocessed and encoded and is ready to use.
Prepare an Encoder
We have discussed that the dataset comes with texts being encoded into integers. Encoding into integers is mandatory since machines can read only numbers. However, humans can not read those integer texts. Hence, we need a decoder that can reverse the encoding action, by which we can convert the numbers into text and read in English. We need an encoder that can convert an example text (from outside of the dataset) into integers.
Metadata that comes with the dataset contains the encoder originally used while preparing the dataset. It can perform encoding and decoding operations.
meta.features
Output:
It can be observed that metadata contains the encoder under the key ‘text’.
# extract the encoder encoder = meta.features['text'].encoder
The encoded integers will be numbered from 1 to vocabulary size. How many vocabulary words are there in the encoder?
encoder.vocab_size
Output:
What are the original text words?
print(encoder.subwords)
A portion of the output:
Test the encoder by sampling a sentence, encoding it into integers, and decoding back into text.
example = 'Analytics India Magazine !' enc = encoder.encode(example) enc
Output:
We have provided a sentence with three words and one exclamation mark, but it is encoded into an eleven-element integer list. The split words are technically called tokens. Let’s explore the numbers and corresponding tokens by using the decode method.
for integer in enc: text = encoder.decode([integer]) print('%4d : %s'%(integer, text))
Output:
Preprocess the Dataset
The input texts are of variable lengths. But a deep learning model can not accept inputs of different sizes. We have to fix the length of each input token. If there are fewer tokens than fixed length, the vector will be made up by padding with zeros. It is accomplished by using the padded_batch method. It pads the sequences in a batch to have an equal number of sequence lengths. Since the large vocabulary size will make the manipulations complicated; it should be embedded into a small-sized vector representation. We perform this process with an Embedding layer.
BUFFER_SIZE = 10000 BATCH_SIZE = 64 AUTOTUNE = tf.data.AUTOTUNE train_data = train.shuffle(BUFFER_SIZE) train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([None],[])) train_data = train_data.prefetch(AUTOTUNE) test_data = test.padded_batch(BATCH_SIZE, padded_shapes=([None],[])) embed_layer = keras.layers.Embedding(encoder.vocab_size, 64)
Build the Model
Unlike images and structured data, texts have a sequential order of tokens that contribute to the context. Hence, the deep learning model should be able to remember past tokens in order when processing a specific token. This is achieved by implementing either Recurrent Neural Networks or Transformers. Here, we prefer Recurrent Neural Networks with LSTM units to model our problem. LSTM (Long-Short Term Memory) units capture the temporal relationship of the past portion of the embedded sequence in memory and models the sequential relationships among texts. LSTM units can be modeled with bi-directional layers so that the model can understand the context of a sentence in both directions, namely, left-to-right and right-to-left.
model = keras.Sequential([ # embedding layer embed_layer, # bidirectional LSTM layers Bidirectional(LSTM(64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)), Bidirectional(LSTM(32, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)), Bidirectional(LSTM(16, dropout=0.5, recurrent_dropout=0.5)), # Classification head Dense(64, activation='relu', kernel_regularizer='l2'), Dropout(0.5), Dense(1, activation='sigmoid') ])
We have used dropout layers and kernel regularizer to contain the overfitting of the model. In the LSTM layer, dropout is executed in two stages, one for the input data and another for the recurrent temporal data.
How many parameters does the model have?
model.summary()
Output:
Plotting the model gives a better understanding of data flow through layers.
keras.utils.plot_model(model, show_shapes=True, dpi=48)
Output:
Train the Model
Compile the built model with Adam optimizer, Accuracy metric and Binary Cross-entropy loss function.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Train the model for 2 epochs. It should be noted that model training may take more time than multi-layer perceptrons (MLPs) and CNNs, because of handling temporal relationships in LSTM layers.
history = model.fit(train_data, validation_data=test_data, epochs=2)
Output:
Training for two epochs has taken more than 7 hours on CPU runtime in a virtual machine with 12GB RAM. A runtime with GPU or TPU will not help reduce the training time, because these accelerating runtimes are designed exclusively for image processing networks, such as convolutional neural networks. This is one of the reasons people opt for pre-trained models, such as BERT, for deployment.
Model Performance Evaluation
The model has been trained and is ready to make inferences. Plot the training losses to have a better understanding of its performance.
hist = history.history plt.plot(hist['loss']) plt.plot(hist['val_loss']) plt.legend(labels=['Training', 'Validation']) plt.xlabel('Epochs') plt.ylabel('Loss') plt.show()
Output:
Training Loss goes down in two epochs. But training for more epochs may help the model to reduce the losses and learn the pattern better.
Model Inference – Sentiment Analysis
Sample prediction on three synthetic reviews
# Sample prediction samples = ['The plot is fantastic', 'The movie was cool and thrilling', 'one of the worst films I have ever seen'] # encode into integers sample_encoded = [encoder.encode(sample) for sample in samples] # pad with zeros to have same length sample_padded = [] for s in sample_encoded: pad_length = 128 - len(s) zeros = [0]*pad_length s.extend(zeros) s = tf.convert_to_tensor(s) sample_padded.append(s) # convert into tensor before feeding the model sample_padded = tf.convert_to_tensor(sample_padded) #make predictions predictions = model.predict(sample_padded) predictions
Output:
Prediction above 0.5 refers to a positive review and below 0.5 refers to a negative review.
print('Predictions on sample test reviews... \n') for i in range(len(samples)): pred = predictions[i][0] sentiment = 'positive' if pred>0.5 else 'negative' print('%40s : %s'%(samples[i], sentiment))
Output:
This Notebook carries the above code implementation.
Wrapping Up
In this article, we have discussed sentiment analysis with text data. We have learnt hands-on TensorFlow implementation of sentiment analysis with the large IMDB movie review dataset. We have processed the data by padding it, embedding it, building an RNN model with bidirectional LSTM layers, and trained the model. Finally, we have evaluated the model by predicting some sample movie reviews.