Now Reading
Implementing A Recurrent Neural Network (RNN) From Scratch

Implementing A Recurrent Neural Network (RNN) From Scratch

Recurrent Neural Network

Fully-connected neural networks and CNNs all learn a one-to-one mapping, for instance, mapping images to the number in the image or mapping given values of features to a prediction.  The gist is that the size of the input is fixed in all these “vanilla” neural networks. In this article, we’ll understand and build Recurrent Neural Network (RNNs), which learn functions that can be one-to-many, many-to-one, many-to-many. But what does that mean? They take as input sequences, such as speech, natural language, time series, or video. So when there’s a lot of information being conveyed sequentially, in the temporal change of the data, that’s where recurrent neural networks thrive. 

Formulating the Neural Network

Let’s take the example of a “many-to-many” RNN because that’s the problem type we’ll be working on. The inputs and outputs are denoted by x0, x1, … xn and y0, y1, … yn, respectively, where xi and yi are vectors with arbitrary dimensions. RNNs learn the temporal information with the help of a hidden state h, which is also a vector with arbitrary dimension.  

Register for our upcoming Masterclass>>
many-to-many recurrent neural network

For any given time step, the hidden state ht is calculated using the previous hidden state ht-1 and the current input ht:  

This ht vector is then used to calculate the output yt:

Here Wxh, Whh, and Why are the weight matrices for xt -> ht mappings,  ht-1 -> ht mappings and ht -> yt mappings, respectively. And bh and by are the bias vectors. 

Looking for a job change? Let us help you.

Gradient Problems in RNN

As powerful as recurrent neural networks are, they’re highly susceptible to gradient related problems in training. A network with n hidden layers will have n derivatives that will be multiplied together. If these derivatives are large, the gradient increases exponentially as it propagates backwards until it eventually explodes. This is called the problem of exploding gradient. Alternatively, if the derivatives are small, the gradient decreases exponentially as it is propagated until it eventually vanishes, and this is called the vanishing gradient problem.

We’ll use the following “tricks” to help minimize the effect of these issues:

  • Gradient Clipping- To avoid exploding gradients, simply limit the size of the gradients when the model is training.  The details of how gradient clipping works are beyond this article’s scope; you can read more about it here.
gradient clipping visualized
  • Weight Initialization – Initializing the weights to identity matrices and biases to zero help prevent the weights from shrinking to zero. You can learn more about this here.

Implementing a Recurrent Neural Network

We will be building a character level prediction RNN and train in on the text of “Harry Potter and the Philosopher’s Stone” because why not. Let’s start by initializing the model parameters, weights and biases.

 import numpy as np
 import matplotlib.pyplot as plt
 class ReccurentNN:
     def __init__(self, char_to_idx, idx_to_char, vocab, h_size=75,
                  seq_len=20, clip_value=5, epochs=50, learning_rate=1e-2):
         self.n_h = h_size 
         self.seq_len = seq_len 
         self.clip_value = clip_value  
         self.epochs = epochs  
         self.learning_rate = learning_rate
         self.char_to_idx = char_to_idx  
         self.idx_to_char = idx_to_char  
         self.vocab = vocab  
         # smoothing loss as batch SGD is noisy
         self.smooth_loss = -np.log(1.0 / self.vocab) * self.seq_len  
         # initialize parameters
         self.params = {}
         self.params["W_xh"] = np.random.randn(self.vocab, self.n_h) * 0.01 
         self.params["W_hh"] = np.identity(self.n_h) * 0.01
         self.params["b_h"] = np.zeros((1, self.n_h))
         self.params["W_hy"] = np.random.randn(self.n_h, self.vocab) * 0.01
         self.params["b_y"] = np.zeros((1, self.vocab))
         self.h0 = np.zeros((1, self.n_h))  # value of the hidden state at time step t = -1
         # initialize gradients and memory parameters for Adagrad
         self.grads = {}
         self.m_params = {}
         for key in self.params:
             self.grads["d" + key] = np.zeros_like(self.params[key])
             self.m_params["m" + key] = np.zeros_like(self.params[key]) 

The loss function of SGD is noisy so we’re smoothing it, and we’re using AdaGrad to adapt the learning rate based on the parameters and data observed in earlier iterations. 

Create the functions for encoding the text characters and creating batches

 def _encode_text(self, X):
         X_encoded = []
         for char in X:
             X_encoded.append(self.char_to_idx[char])
         return X_encoded

 def _prepare_batches(self, X, index):
         X_batch_encoded = X[index: index + self.seq_len]
         y_batch_encoded = X[index + 1: index + self.seq_len + 1]
         X_batch = []
         y_batch = []
         for i in X_batch_encoded:
             one_hot_char = np.zeros((1, self.vocab))
             one_hot_char[0][i] = 1
             X_batch.append(one_hot_char)
         for j in y_batch_encoded:
             one_hot_char = np.zeros((1, self.vocab))
             one_hot_char[0][j] = 1
             y_batch.append(one_hot_char)
         return X_batch, y_batch 

Create the softmax method that takes the final logits and gives probabilities

     def _softmax(self, x):
         e_x = np.exp(x - np.max(x)) 
         return e_x / np.sum(e_x) 

The maximum logit value is subtracted to improve numerical stability; you can learn more about this here. Implement the forward pass function that takes the current input sequence xt and previous hidden state ht to  calculates ht and yt 

     def _forward_pass(self, X):
         h = {}  # stores hidden states
         h[-1] = self.h0  # set initial hidden state at t=-1
         y_pred = {}  # stores softmax output probabilities
         # iterate over each character in the input sequence
         for t in range(self.seq_len):
             h[t] = np.tanh(
                 np.dot(X[t], self.params["W_xh"]) + np.dot(h[t - 1], self.params["W_hh"]) + self.params["b_h"])
             y_pred[t] = self._softmax(np.dot(h[t], self.params["W_hy"]) + self.params["b_y"])
         self.ho = h[t]
         return y_pred, h 

Create the function that backpropagates the error and calculate the gradients

     def _backward_pass(self, X, y, y_pred, h):
         dh_next = np.zeros_like(h[0])
         for t in reversed(range(self.seq_len)):
             dy = np.copy(y_pred[t])
             dy[0][np.argmax(y[t])] -= 1  # predicted y - actual y
             self.grads["dW_hy"] += np.dot(h[t].T, dy)
             self.grads["db_y"] += dy
             dhidden = (1 - h[t] ** 2) * (np.dot(dy, self.params["W_hy"].T) + dh_next)
             dh_next = np.dot(dhidden, self.params["W_hh"].T)
             self.grads["dW_hh"] += np.dot(h[t - 1].T, dhidden)
             self.grads["dW_xh"] += np.dot(X[t].T, dhidden)
             self.grads["db_h"] += dhidden
         for grad, key in enumerate(self.grads):
             np.clip(self.grads[key], -self.clip_value, self.clip_value, out=self.grads[key])
         return 

Function for update the parameters using AdaGrad 

     def _update(self):
         for key in self.params:
             self.m_params["m" + key] += self.grads["d" + key] * self.grads["d" + key]
             self.params[key] -= self.grads["d" + key] * self.learning_rate / (np.sqrt(self.m_params["m" + key]) + 1e-8) 

A test method that generates a sequence of characters of size test_size starting at a given index

     def test(self, test_size, start_index):
         res = ""
         x = np.zeros((1, self.vocab))
         x[0][start_index] = 1
         for i in range(test_size):
             # forward propagation
             h = np.tanh(np.dot(x, self.params["W_xh"]) + np.dot(self.h0, self.params["W_hh"]) + self.params["b_h"])
             y_pred = self._softmax(np.dot(h, self.params["W_hy"]) + self.params["b_y"])
             # get a random index from the probability distribution of y
             index = np.random.choice(range(self.vocab), p=y_pred.ravel())
             # set x-one_hot_vector for the next character
             x = np.zeros((1, self.vocab))
             x[0][index] = 1
             # find the char with the index and concat to the output string
             char = self.idx_to_char[index]
             res += char
         return res 

And finally, the training method that brings this all together.

     def train(self, X):
         loss = []
         # trim end of the text so we only get full sequences
         num_batches = len(X) // self.seq_len
         X_trimmed = X[:num_batches * self.seq_len] 
         # encode the characters to indices
         X_encoded = self._encode_text(X_trimmed) 
         for i in range(self.epochs):
             for j in range(0, len(X_encoded) - self.seq_len, self.seq_len):
                 X_batch, y_batch = self._prepare_batches(X_encoded, j)
                 y_pred, h = self._forward_pass(X_batch)
                 loss = 0
                 for t in range(self.seq_len):
                     loss += -np.log(y_pred[t][0, np.argmax(y_batch[t])])
                 self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
                 loss.append(self.smooth_loss)
                 self._backward_pass(X_batch, y_batch, y_pred, h)
                 self._update()
             print(f'Epoch: {i + 1}\tLoss: {loss}')
             print(self.test(50,2))
         return loss, self.params 

Now let’s our recurrent neural network in action. 

 with open('Harry-Potter.txt') as f:
     text = f.read().lower()
 # use only a part of the text to make the process faster
 text = text[:20000] 
 chars = set(text)
 vocab = len(chars)
 # creating the encoding decoding dictionaries
 char_to_idx = {w: i for i, w in enumerate(chars)}
 idx_to_char = {i: w for i, w in enumerate(chars)}
 parameter_dict = {
         'char_to_idx': char_to_idx,
         'idx_to_char': idx_to_char,
         'vocab': vocab,
         'h_size': 75,
         'seq_len': 20,  # keep small to avoid diminishing/exploding gradients
         'clip_value': 5,
         'epochs': 50,
         'learning_rate': 1e-2,
     }
 model = ReccurentNN(**parameter_dict)
 loss, params = model.train(text)
 plt.figure(figsize=(12, 8))
 plt.plot([i for i in range(len(loss))], loss)
 plt.ylabel("Loss")
 plt.xlabel("Epochs")
 plt.show()
 print(model.test(50,10)) 
RNN from scratch loss curve
is othe on. ogofostheodindearidut wlethallle, st oserarey d -lers amoathe y thasathey at dll tos dn t s med d.). t t ile brs t d g htherive, d ogostare d. ay shag hythay boumay tey thas ot havininggon

Even with all our hacks and tricks the recurrent neural network still suffers from gradient problems and is only able to learn small sequences of characters. In the coming weeks, we’ll introduce more complex recurrent units with gates and try to improve the performance of our RNN. 

You can find the “Harry Potter and the Philosopher’s Stone” book text here. The above implementation has been made with a lot of help from this gist, and the code can be found in a Colab notebook here. Also, in hindsight, using the text of a book that contains spells and other non-English words might have made the task unnecessarily harder. 

References:
What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top