Last updated September 6, 2021
In AI Mysteries

Implementing A Recurrent Neural Network (RNN) From Scratch

In this article we implement a character level recurrent neural network (RNN) from scratch in Python using NumPy.

Share

Published on May 22, 2021

by Aditya Singh

Fully-connected neural networks and CNNs all learn a one-to-one mapping, for instance, mapping images to the number in the image or mapping given values of features to a prediction. The gist is that the size of the input is fixed in all these “vanilla” neural networks. In this article, we’ll understand and build Recurrent Neural Network (RNNs), which learn functions that can be one-to-many, many-to-one, many-to-many. But what does that mean? They take as input sequences, such as speech, natural language, time series, or video. So when there’s a lot of information being conveyed sequentially, in the temporal change of the data, that’s where recurrent neural networks thrive.

Recurrent neural networks versus vanilla neural networks — Source: MIT 6.S094 Slides

Formulating the Neural Network

Let’s take the example of a “many-to-many” RNN because that’s the problem type we’ll be working on. The inputs and outputs are denoted by x₀, x₁, … x_n and y₀, y₁, … y_n, respectively, where x_i and y_i are vectors with arbitrary dimensions. RNNs learn the temporal information with the help of a hidden state h, which is also a vector with arbitrary dimension.

For any given time step, the hidden state h_t is calculated using the previous hidden state h_t-1 and the current input h_t:

This h_t vector is then used to calculate the output y_t:

Here W_xh, W_hh, and W_hy are the weight matrices for x_t -> h_t mappings, h_t-1 -> h_t mappings and h_t -> y_t mappings, respectively. And b_h and b_y are the bias vectors.

Gradient Problems in RNN

As powerful as recurrent neural networks are, they’re highly susceptible to gradient related problems in training. A network with n hidden layers will have n derivatives that will be multiplied together. If these derivatives are large, the gradient increases exponentially as it propagates backwards until it eventually explodes. This is called the problem of exploding gradient. Alternatively, if the derivatives are small, the gradient decreases exponentially as it is propagated until it eventually vanishes, and this is called the vanishing gradient problem.

We’ll use the following “tricks” to help minimize the effect of these issues:

Gradient Clipping- To avoid exploding gradients, simply limit the size of the gradients when the model is training. The details of how gradient clipping works are beyond this article’s scope; you can read more about it here.

Weight Initialization – Initializing the weights to identity matrices and biases to zero help prevent the weights from shrinking to zero. You can learn more about this here.

Implementing a Recurrent Neural Network

We will be building a character level prediction RNN and train in on the text of “Harry Potter and the Philosopher’s Stone” because why not. Let’s start by initializing the model parameters, weights and biases.

 import numpy as np
 import matplotlib.pyplot as plt
 class ReccurentNN:
     def __init__(self, char_to_idx, idx_to_char, vocab, h_size=75,
                  seq_len=20, clip_value=5, epochs=50, learning_rate=1e-2):
         self.n_h = h_size 
         self.seq_len = seq_len 
         self.clip_value = clip_value  
         self.epochs = epochs  
         self.learning_rate = learning_rate
         self.char_to_idx = char_to_idx  
         self.idx_to_char = idx_to_char  
         self.vocab = vocab  
         # smoothing loss as batch SGD is noisy
         self.smooth_loss = -np.log(1.0 / self.vocab) * self.seq_len  
         # initialize parameters
         self.params = {}
         self.params["W_xh"] = np.random.randn(self.vocab, self.n_h) * 0.01 
         self.params["W_hh"] = np.identity(self.n_h) * 0.01
         self.params["b_h"] = np.zeros((1, self.n_h))
         self.params["W_hy"] = np.random.randn(self.n_h, self.vocab) * 0.01
         self.params["b_y"] = np.zeros((1, self.vocab))
         self.h0 = np.zeros((1, self.n_h))  # value of the hidden state at time step t = -1
         # initialize gradients and memory parameters for Adagrad
         self.grads = {}
         self.m_params = {}
         for key in self.params:
             self.grads["d" + key] = np.zeros_like(self.params[key])
             self.m_params["m" + key] = np.zeros_like(self.params[key])

The loss function of SGD is noisy so we’re smoothing it, and we’re using AdaGrad to adapt the learning rate based on the parameters and data observed in earlier iterations.

Create the functions for encoding the text characters and creating batches

 def _encode_text(self, X):
         X_encoded = []
         for char in X:
             X_encoded.append(self.char_to_idx[char])
         return X_encoded

 def _prepare_batches(self, X, index):
         X_batch_encoded = X[index: index + self.seq_len]
         y_batch_encoded = X[index + 1: index + self.seq_len + 1]
         X_batch = []
         y_batch = []
         for i in X_batch_encoded:
             one_hot_char = np.zeros((1, self.vocab))
             one_hot_char[0][i] = 1
             X_batch.append(one_hot_char)
         for j in y_batch_encoded:
             one_hot_char = np.zeros((1, self.vocab))
             one_hot_char[0][j] = 1
             y_batch.append(one_hot_char)
         return X_batch, y_batch

Create the softmax method that takes the final logits and gives probabilities

     def _softmax(self, x):
         e_x = np.exp(x - np.max(x)) 
         return e_x / np.sum(e_x)

The maximum logit value is subtracted to improve numerical stability; you can learn more about this here. Implement the forward pass function that takes the current input sequence x_t and previous hidden state h_t to calculates h_t and y_t

     def _forward_pass(self, X):
         h = {}  # stores hidden states
         h[-1] = self.h0  # set initial hidden state at t=-1
         y_pred = {}  # stores softmax output probabilities
         # iterate over each character in the input sequence
         for t in range(self.seq_len):
             h[t] = np.tanh(
                 np.dot(X[t], self.params["W_xh"]) + np.dot(h[t - 1], self.params["W_hh"]) + self.params["b_h"])
             y_pred[t] = self._softmax(np.dot(h[t], self.params["W_hy"]) + self.params["b_y"])
         self.ho = h[t]
         return y_pred, h

Create the function that backpropagates the error and calculate the gradients

     def _backward_pass(self, X, y, y_pred, h):
         dh_next = np.zeros_like(h[0])
         for t in reversed(range(self.seq_len)):
             dy = np.copy(y_pred[t])
             dy[0][np.argmax(y[t])] -= 1  # predicted y - actual y
             self.grads["dW_hy"] += np.dot(h[t].T, dy)
             self.grads["db_y"] += dy
             dhidden = (1 - h[t] ** 2) * (np.dot(dy, self.params["W_hy"].T) + dh_next)
             dh_next = np.dot(dhidden, self.params["W_hh"].T)
             self.grads["dW_hh"] += np.dot(h[t - 1].T, dhidden)
             self.grads["dW_xh"] += np.dot(X[t].T, dhidden)
             self.grads["db_h"] += dhidden
         for grad, key in enumerate(self.grads):
             np.clip(self.grads[key], -self.clip_value, self.clip_value, out=self.grads[key])
         return

Function for update the parameters using AdaGrad

     def _update(self):
         for key in self.params:
             self.m_params["m" + key] += self.grads["d" + key] * self.grads["d" + key]
             self.params[key] -= self.grads["d" + key] * self.learning_rate / (np.sqrt(self.m_params["m" + key]) + 1e-8)

A test method that generates a sequence of characters of size test_size starting at a given index

     def test(self, test_size, start_index):
         res = ""
         x = np.zeros((1, self.vocab))
         x[0][start_index] = 1
         for i in range(test_size):
             # forward propagation
             h = np.tanh(np.dot(x, self.params["W_xh"]) + np.dot(self.h0, self.params["W_hh"]) + self.params["b_h"])
             y_pred = self._softmax(np.dot(h, self.params["W_hy"]) + self.params["b_y"])
             # get a random index from the probability distribution of y
             index = np.random.choice(range(self.vocab), p=y_pred.ravel())
             # set x-one_hot_vector for the next character
             x = np.zeros((1, self.vocab))
             x[0][index] = 1
             # find the char with the index and concat to the output string
             char = self.idx_to_char[index]
             res += char
         return res

And finally, the training method that brings this all together.

     def train(self, X):
         loss = []
         # trim end of the text so we only get full sequences
         num_batches = len(X) // self.seq_len
         X_trimmed = X[:num_batches * self.seq_len] 
         # encode the characters to indices
         X_encoded = self._encode_text(X_trimmed) 
         for i in range(self.epochs):
             for j in range(0, len(X_encoded) - self.seq_len, self.seq_len):
                 X_batch, y_batch = self._prepare_batches(X_encoded, j)
                 y_pred, h = self._forward_pass(X_batch)
                 loss = 0
                 for t in range(self.seq_len):
                     loss += -np.log(y_pred[t][0, np.argmax(y_batch[t])])
                 self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
                 loss.append(self.smooth_loss)
                 self._backward_pass(X_batch, y_batch, y_pred, h)
                 self._update()
             print(f'Epoch: {i + 1}\tLoss: {loss}')
             print(self.test(50,2))
         return loss, self.params

Now let’s our recurrent neural network in action.

 with open('Harry-Potter.txt') as f:
     text = f.read().lower()
 # use only a part of the text to make the process faster
 text = text[:20000] 
 chars = set(text)
 vocab = len(chars)
 # creating the encoding decoding dictionaries
 char_to_idx = {w: i for i, w in enumerate(chars)}
 idx_to_char = {i: w for i, w in enumerate(chars)}
 parameter_dict = {
         'char_to_idx': char_to_idx,
         'idx_to_char': idx_to_char,
         'vocab': vocab,
         'h_size': 75,
         'seq_len': 20,  # keep small to avoid diminishing/exploding gradients
         'clip_value': 5,
         'epochs': 50,
         'learning_rate': 1e-2,
     }
 model = ReccurentNN(**parameter_dict)
 loss, params = model.train(text)
 plt.figure(figsize=(12, 8))
 plt.plot([i for i in range(len(loss))], loss)
 plt.ylabel("Loss")
 plt.xlabel("Epochs")
 plt.show()
 print(model.test(50,10))

is othe on. ogofostheodindearidut wlethallle, st oserarey d -lers amoathe y thasathey at dll tos dn t s med d.). t t ile brs t d g htherive, d ogostare d. ay shag hythay boumay tey thas ot havininggon

Even with all our hacks and tricks the recurrent neural network still suffers from gradient problems and is only able to learn small sequences of characters. In the coming weeks, we’ll introduce more complex recurrent units with gates and try to improve the performance of our RNN.

You can find the “Harry Potter and the Philosopher’s Stone” book text here. The above implementation has been made with a lot of help from this gist, and the code can be found in a Colab notebook here. Also, in hindsight, using the text of a book that contains spells and other non-English words might have made the task unnecessarily harder.