Fully-connected neural networks and CNNs all learn a one-to-one mapping, for instance, mapping images to the number in the image or mapping given values of features to a prediction. The gist is that the size of the input is fixed in all these “vanilla” neural networks. In this article, we’ll understand and build Recurrent Neural Network (RNNs), which learn functions that can be one-to-many, many-to-one, many-to-many. But what does that mean? They take as input sequences, such as speech, natural language, time series, or video. So when there’s a lot of information being conveyed sequentially, in the temporal change of the data, that’s where recurrent neural networks thrive.
Formulating the Neural Network
Let’s take the example of a “many-to-many” RNN because that’s the problem type we’ll be working on. The inputs and outputs are denoted by x0, x1, … xn and y0, y1, … yn, respectively, where xi and yi are vectors with arbitrary dimensions. RNNs learn the temporal information with the help of a hidden state h, which is also a vector with arbitrary dimension.
For any given time step, the hidden state ht is calculated using the previous hidden state ht-1 and the current input ht:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
This ht vector is then used to calculate the output yt:
Here Wxh, Whh, and Why are the weight matrices for xt -> ht mappings, ht-1 -> ht mappings and ht -> yt mappings, respectively. And bh and by are the bias vectors.
Gradient Problems in RNN
As powerful as recurrent neural networks are, they’re highly susceptible to gradient related problems in training. A network with n hidden layers will have n derivatives that will be multiplied together. If these derivatives are large, the gradient increases exponentially as it propagates backwards until it eventually explodes. This is called the problem of exploding gradient. Alternatively, if the derivatives are small, the gradient decreases exponentially as it is propagated until it eventually vanishes, and this is called the vanishing gradient problem.
We’ll use the following “tricks” to help minimize the effect of these issues:
- Gradient Clipping- To avoid exploding gradients, simply limit the size of the gradients when the model is training. The details of how gradient clipping works are beyond this article’s scope; you can read more about it here.
- Weight Initialization – Initializing the weights to identity matrices and biases to zero help prevent the weights from shrinking to zero. You can learn more about this here.
Implementing a Recurrent Neural Network
We will be building a character level prediction RNN and train in on the text of “Harry Potter and the Philosopher’s Stone” because why not. Let’s start by initializing the model parameters, weights and biases.
import numpy as np import matplotlib.pyplot as plt class ReccurentNN: def __init__(self, char_to_idx, idx_to_char, vocab, h_size=75, seq_len=20, clip_value=5, epochs=50, learning_rate=1e-2): self.n_h = h_size self.seq_len = seq_len self.clip_value = clip_value self.epochs = epochs self.learning_rate = learning_rate self.char_to_idx = char_to_idx self.idx_to_char = idx_to_char self.vocab = vocab # smoothing loss as batch SGD is noisy self.smooth_loss = -np.log(1.0 / self.vocab) * self.seq_len # initialize parameters self.params = {} self.params["W_xh"] = np.random.randn(self.vocab, self.n_h) * 0.01 self.params["W_hh"] = np.identity(self.n_h) * 0.01 self.params["b_h"] = np.zeros((1, self.n_h)) self.params["W_hy"] = np.random.randn(self.n_h, self.vocab) * 0.01 self.params["b_y"] = np.zeros((1, self.vocab)) self.h0 = np.zeros((1, self.n_h)) # value of the hidden state at time step t = -1 # initialize gradients and memory parameters for Adagrad self.grads = {} self.m_params = {} for key in self.params: self.grads["d" + key] = np.zeros_like(self.params[key]) self.m_params["m" + key] = np.zeros_like(self.params[key])
The loss function of SGD is noisy so we’re smoothing it, and we’re using AdaGrad to adapt the learning rate based on the parameters and data observed in earlier iterations.
Create the functions for encoding the text characters and creating batches
def _encode_text(self, X): X_encoded = [] for char in X: X_encoded.append(self.char_to_idx[char]) return X_encoded def _prepare_batches(self, X, index): X_batch_encoded = X[index: index + self.seq_len] y_batch_encoded = X[index + 1: index + self.seq_len + 1] X_batch = [] y_batch = [] for i in X_batch_encoded: one_hot_char = np.zeros((1, self.vocab)) one_hot_char[0][i] = 1 X_batch.append(one_hot_char) for j in y_batch_encoded: one_hot_char = np.zeros((1, self.vocab)) one_hot_char[0][j] = 1 y_batch.append(one_hot_char) return X_batch, y_batch
Create the softmax method that takes the final logits and gives probabilities
def _softmax(self, x): e_x = np.exp(x - np.max(x)) return e_x / np.sum(e_x)
The maximum logit value is subtracted to improve numerical stability; you can learn more about this here. Implement the forward pass function that takes the current input sequence xt and previous hidden state ht to calculates ht and yt
def _forward_pass(self, X): h = {} # stores hidden states h[-1] = self.h0 # set initial hidden state at t=-1 y_pred = {} # stores softmax output probabilities # iterate over each character in the input sequence for t in range(self.seq_len): h[t] = np.tanh( np.dot(X[t], self.params["W_xh"]) + np.dot(h[t - 1], self.params["W_hh"]) + self.params["b_h"]) y_pred[t] = self._softmax(np.dot(h[t], self.params["W_hy"]) + self.params["b_y"]) self.ho = h[t] return y_pred, h
Create the function that backpropagates the error and calculate the gradients
def _backward_pass(self, X, y, y_pred, h): dh_next = np.zeros_like(h[0]) for t in reversed(range(self.seq_len)): dy = np.copy(y_pred[t]) dy[0][np.argmax(y[t])] -= 1 # predicted y - actual y self.grads["dW_hy"] += np.dot(h[t].T, dy) self.grads["db_y"] += dy dhidden = (1 - h[t] ** 2) * (np.dot(dy, self.params["W_hy"].T) + dh_next) dh_next = np.dot(dhidden, self.params["W_hh"].T) self.grads["dW_hh"] += np.dot(h[t - 1].T, dhidden) self.grads["dW_xh"] += np.dot(X[t].T, dhidden) self.grads["db_h"] += dhidden for grad, key in enumerate(self.grads): np.clip(self.grads[key], -self.clip_value, self.clip_value, out=self.grads[key]) return
Function for update the parameters using AdaGrad
def _update(self): for key in self.params: self.m_params["m" + key] += self.grads["d" + key] * self.grads["d" + key] self.params[key] -= self.grads["d" + key] * self.learning_rate / (np.sqrt(self.m_params["m" + key]) + 1e-8)
A test method that generates a sequence of characters of size test_size starting at a given index
def test(self, test_size, start_index): res = "" x = np.zeros((1, self.vocab)) x[0][start_index] = 1 for i in range(test_size): # forward propagation h = np.tanh(np.dot(x, self.params["W_xh"]) + np.dot(self.h0, self.params["W_hh"]) + self.params["b_h"]) y_pred = self._softmax(np.dot(h, self.params["W_hy"]) + self.params["b_y"]) # get a random index from the probability distribution of y index = np.random.choice(range(self.vocab), p=y_pred.ravel()) # set x-one_hot_vector for the next character x = np.zeros((1, self.vocab)) x[0][index] = 1 # find the char with the index and concat to the output string char = self.idx_to_char[index] res += char return res
And finally, the training method that brings this all together.
def train(self, X): loss = [] # trim end of the text so we only get full sequences num_batches = len(X) // self.seq_len X_trimmed = X[:num_batches * self.seq_len] # encode the characters to indices X_encoded = self._encode_text(X_trimmed) for i in range(self.epochs): for j in range(0, len(X_encoded) - self.seq_len, self.seq_len): X_batch, y_batch = self._prepare_batches(X_encoded, j) y_pred, h = self._forward_pass(X_batch) loss = 0 for t in range(self.seq_len): loss += -np.log(y_pred[t][0, np.argmax(y_batch[t])]) self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001 loss.append(self.smooth_loss) self._backward_pass(X_batch, y_batch, y_pred, h) self._update() print(f'Epoch: {i + 1}\tLoss: {loss}') print(self.test(50,2)) return loss, self.params
Now let’s our recurrent neural network in action.
with open('Harry-Potter.txt') as f: text = f.read().lower() # use only a part of the text to make the process faster text = text[:20000] chars = set(text) vocab = len(chars) # creating the encoding decoding dictionaries char_to_idx = {w: i for i, w in enumerate(chars)} idx_to_char = {i: w for i, w in enumerate(chars)} parameter_dict = { 'char_to_idx': char_to_idx, 'idx_to_char': idx_to_char, 'vocab': vocab, 'h_size': 75, 'seq_len': 20, # keep small to avoid diminishing/exploding gradients 'clip_value': 5, 'epochs': 50, 'learning_rate': 1e-2, } model = ReccurentNN(**parameter_dict) loss, params = model.train(text) plt.figure(figsize=(12, 8)) plt.plot([i for i in range(len(loss))], loss) plt.ylabel("Loss") plt.xlabel("Epochs") plt.show() print(model.test(50,10))
is othe on. ogofostheodindearidut wlethallle, st oserarey d -lers amoathe y thasathey at dll tos dn t s med d.). t t ile brs t d g htherive, d ogostare d. ay shag hythay boumay tey thas ot havininggon
Even with all our hacks and tricks the recurrent neural network still suffers from gradient problems and is only able to learn small sequences of characters. In the coming weeks, we’ll introduce more complex recurrent units with gates and try to improve the performance of our RNN.
You can find the “Harry Potter and the Philosopher’s Stone” book text here. The above implementation has been made with a lot of help from this gist, and the code can be found in a Colab notebook here. Also, in hindsight, using the text of a book that contains spells and other non-English words might have made the task unnecessarily harder.