In the modern age of data science, neural networks are emerging drastically because they have the ability to perform tasks rapidly and easily. There are various kinds of neural networks which we use to perform a variety of tasks. Here in this article, we will be focused on the LSTM model, one of the variants of the neural network. In one of our previous articles, we have discussed that the LSTM networks perform better with sequential data like time series. Here, we will consider text data as the sequential data and we will try to fit a LSTM model with this. The major points to be discussed in this article are given below.
Table of Contents
- Introduction to LSTM
- The Architecture of LSTM
- Forget gate
- Input gate.
- Cell state
- Output gate
- Why do we use LSTM with text data?
- Text classification using LSTM
LSTM (Long Short-Term Memory) network is a type of RNN (Recurrent Neural Network) that is widely used for learning sequential data prediction problems. As every other neural network LSTM also has some layers which help it to learn and recognize the pattern for better performance. The basic operation of LSTM can be considered to hold the required information and discard the information which is not required or useful for further prediction.
There can be various LSTM network types but we can divide them roughly into three types.
As the name suggests the forward pass and backward pass LSTM are unidirectional LSTM which process the information in one direction either on the forward side or on the backside where the bidirectional LSTM processes the data on both sides to persist the information. All the above-given LSTM types work on a basic structure. Updating the basic structure causes the difference between various LSTM. Next, in the article, we will see different components of a basic LSTM model architecture.
The Architecture of LSTM
A simple LSTM network consists of the following components.
- Forget gate
- Input gate.
- Output gate
As the hidden layers and various gates are added to the simple LSTM it changes its type. Like in BI LSTM network it can consist of two LSTM passing information in an opposite or similar manner.
Let’s have an overview of the gates and the state.
As we have discussed earlier, one of the main properties of the LSTM is to memorize and recognize the information coming inside the network and also to discard the information which is not required to the network to learn the data and predictions. This gate is responsible for this feature of the LSTM.
It helps in deciding whether information can pass through the layers of the network. There are two types of input it expects from the network one of them is the information from the previous layers and another one is the information from the presentation layer.
The above image shows a circuit of Forget gate where h and x are information. This information goes through the sigmoid function where the information which has a tendency towards zero gets eliminated from the network.
Input gate helps in deciding the importance of the information by updating the cell state. where the forget gate helps in the elimination of the information from the network input gate decides the measure of the importance of the information and helps the forget function in elimination of the not important information and other layers to learn the information which is important for making predictions.
The information goes through the sigmoid and tanh functions where the sigmoid decides the weight of information and tanh reduces the bias of the network.
The weight gained information goes through the cell state where this layer calculates the cell state. In the cell state, the output of the forget gate and input gate gets multiplied by each other. The information which has the possibility of dropping out gets multiplied with near-zero values.
Here in the cell state, an addition between input and the output values takes place which tries to get the cell state updated with the information which is relevant to the network.
It is the last gate of the circuit that helps in deciding the next hidden state of the network in which information goes through the sigmoid function. Updated cell from the cell state goes to the tanh function then it gets multiplied by the sigmoid function of the output state. Which helps the hidden state to carry the information.
This is the final stage of the circuit which helps the hidden state to decide which information it should carry.
Why do we use LSTM with text data?
When performing normal text modelling, most of the preprocessing task and modelling task focuses on creating data sequentially. Examples of such tasks can be POS tagging, stopwords elimination, sequencing of the text. These are the methods that try to make data understood by a model with less effort according to the known pattern. It can give the results.
Here applying LSTM networks can have its own special feature. Earlier in the article, we have discussed that LSTM has a feature through which it can memorize the sequence of the data. It has one more feature that it works on the elimination of unused information and as we know the text data always consists a lot of unused information which can be eliminated by the LSTM so that the calculation timing and cost can be reduced,
So basically the feature of elimination of unused information and memorizing the sequence of the information makes the LSTM a powerful tool for performing text classification or other text-based tasks.
Text classification using LSTM
In this section, I have created a LSTM model for text classification using the IMDB data set provided by Keras that has the reviews on the movies provided by the users on the IMDB site.
You can use the full code for making the model on a similar data set.
import numpy as np from keras.datasets import imdb from keras.layers import LSTM, embeddings, dense from keras.preprocessing.sequence import pad_sequence # fix random seed for reproducibility np.random.seed(7) # load the dataset but only keep the top 6000 words (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=6000) # pad input sequences X_train = pad_sequences(X_train, maxlen=500) X_test = pad_sequences(X_test, maxlen=500) #model model = Sequential() model.add(Embedding(6000, 32, input_length=500)) model.add(LSTM(100)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) # Final evaluation of the model scores = model.evaluate(X_test, y_test, verbose=0) print("Accuracy: %.2f%%" % (scores*100))
Before processing the model we created a similar pad sequence of the data so that it can be put to the model with the same length.
In the modelling, we are making a sequential model. The first layer of the model is the embedding layer which uses the 32 length vector, and the next layer is the LSTM layer which has 100 neurons which will work as the memory unit of the model. After LSTM, the dense layer which is an output layer with sigmoid function, sigmoid function helps in providing the labels.
Here in the data set, we have good or bad reviews which can be classified as 0 and 1 values. The loss function in binary cross-entropy and it is suggested to use adam optimization when working with text classification.
The below image shows the results and summary of the model which we have created.
Here in the model, we used only 3 epochs so that with smaller data the model will not get overfitted. In the image, we can see the result from the model is very satisfactory. It has increased to around 90% and the final accuracy of all three epochs is 85%.
As we have seen in the article we have done nothing in data preprocessing, we just called the data and put it into a simple LSTM model and the model has given very satisfactory results. We can do a number of edits in the data or in the model which can be more helpful for increasing the accuracy of our work. LSTM is a commonly used network with sequential data like time series data, audio data. There are various tasks we can perform in the time series analysis domain using LSTM.