Fully-connected neural networks and CNNs all learn a one-to-one mapping, for instance, mapping images to the number in the image or mapping given values of features to a prediction. The gist is that the size of the input is fixed in all these “vanilla” neural networks. In this article, we’ll understand and build Recurrent Neural Network (RNNs), which learn functions that can be one-to-many, many-to-one, many-to-many. But what does that mean? They take as input sequences, such as speech, natural language, time series, or video. So when there’s a lot of information being conveyed sequentially, in the temporal change of the data, that’s where recurrent neural networks thrive.

### Formulating the Neural Network

Let’s take the example of a “many-to-many” RNN because that’s the problem type we’ll be working on. The inputs and outputs are denoted by x_{0}, x_{1}, … x_{n} and y_{0}, y_{1}, … y_{n,} respectively, where x_{i} and y_{i} are vectors with arbitrary dimensions. RNNs learn the temporal information with the help of a hidden state h, which is also a vector with arbitrary dimension.

For any given time step, the hidden state h_{t} is calculated using the previous hidden state h_{t-1} and the current input h_{t}:

#### Subscribe to our Newsletter

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

##### Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

This h_{t} vector is then used to calculate the output y_{t}:

Here W_{xh}, W_{hh}, and W_{hy} are the weight matrices for x_{t} -> h_{t} mappings, h_{t-1} -> h_{t} mappings and h_{t} -> y_{t} mappings, respectively. And b_{h} and b_{y} are the bias vectors.

### Gradient Problems in RNN

As powerful as recurrent neural networks are, they’re highly susceptible to gradient related problems in training. A network with n hidden layers will have n derivatives that will be multiplied together. If these derivatives are large, the gradient increases exponentially as it propagates backwards until it eventually explodes. This is called the problem of exploding gradient. Alternatively, if the derivatives are small, the gradient decreases exponentially as it is propagated until it eventually vanishes, and this is called the vanishing gradient problem.

We’ll use the following “tricks” to help minimize the effect of these issues:

- Gradient Clipping- To avoid exploding gradients, simply limit the size of the gradients when the model is training. The details of how gradient clipping works are beyond this article’s scope; you can read more about it here.

- Weight Initialization – Initializing the weights to identity matrices and biases to zero help prevent the weights from shrinking to zero. You can learn more about this here.

### Implementing a Recurrent Neural Network

We will be building a character level prediction RNN and train in on the text of “Harry Potter and the Philosopher’s Stone” because why not. Let’s start by initializing the model parameters, weights and biases.

import numpy as np import matplotlib.pyplot as plt class ReccurentNN: def __init__(self, char_to_idx, idx_to_char, vocab, h_size=75, seq_len=20, clip_value=5, epochs=50, learning_rate=1e-2): self.n_h = h_size self.seq_len = seq_len self.clip_value = clip_value self.epochs = epochs self.learning_rate = learning_rate self.char_to_idx = char_to_idx self.idx_to_char = idx_to_char self.vocab = vocab # smoothing loss as batch SGD is noisy self.smooth_loss = -np.log(1.0 / self.vocab) * self.seq_len # initialize parameters self.params = {} self.params["W_xh"] = np.random.randn(self.vocab, self.n_h) * 0.01 self.params["W_hh"] = np.identity(self.n_h) * 0.01 self.params["b_h"] = np.zeros((1, self.n_h)) self.params["W_hy"] = np.random.randn(self.n_h, self.vocab) * 0.01 self.params["b_y"] = np.zeros((1, self.vocab)) self.h0 = np.zeros((1, self.n_h)) # value of the hidden state at time step t = -1 # initialize gradients and memory parameters for Adagrad self.grads = {} self.m_params = {} for key in self.params: self.grads["d" + key] = np.zeros_like(self.params[key]) self.m_params["m" + key] = np.zeros_like(self.params[key])

The loss function of SGD is noisy so we’re smoothing it, and we’re using AdaGrad to adapt the learning rate based on the parameters and data observed in earlier iterations.

Create the functions for encoding the text characters and creating batches

def _encode_text(self, X): X_encoded = [] for char in X: X_encoded.append(self.char_to_idx[char]) return X_encoded def _prepare_batches(self, X, index): X_batch_encoded = X[index: index + self.seq_len] y_batch_encoded = X[index + 1: index + self.seq_len + 1] X_batch = [] y_batch = [] for i in X_batch_encoded: one_hot_char = np.zeros((1, self.vocab)) one_hot_char[0][i] = 1 X_batch.append(one_hot_char) for j in y_batch_encoded: one_hot_char = np.zeros((1, self.vocab)) one_hot_char[0][j] = 1 y_batch.append(one_hot_char) return X_batch, y_batch

Create the softmax method that takes the final logits and gives probabilities

def _softmax(self, x): e_x = np.exp(x - np.max(x)) return e_x / np.sum(e_x)

The maximum logit value is subtracted to improve numerical stability; you can learn more about this here. Implement the forward pass function that takes the current input sequence x_{t} and previous hidden state h_{t} to calculates h_{t} and y_{t}

def _forward_pass(self, X): h = {} # stores hidden states h[-1] = self.h0 # set initial hidden state at t=-1 y_pred = {} # stores softmax output probabilities # iterate over each character in the input sequence for t in range(self.seq_len): h[t] = np.tanh( np.dot(X[t], self.params["W_xh"]) + np.dot(h[t - 1], self.params["W_hh"]) + self.params["b_h"]) y_pred[t] = self._softmax(np.dot(h[t], self.params["W_hy"]) + self.params["b_y"]) self.ho = h[t] return y_pred, h

Create the function that backpropagates the error and calculate the gradients

def _backward_pass(self, X, y, y_pred, h): dh_next = np.zeros_like(h[0]) for t in reversed(range(self.seq_len)): dy = np.copy(y_pred[t]) dy[0][np.argmax(y[t])] -= 1 # predicted y - actual y self.grads["dW_hy"] += np.dot(h[t].T, dy) self.grads["db_y"] += dy dhidden = (1 - h[t] ** 2) * (np.dot(dy, self.params["W_hy"].T) + dh_next) dh_next = np.dot(dhidden, self.params["W_hh"].T) self.grads["dW_hh"] += np.dot(h[t - 1].T, dhidden) self.grads["dW_xh"] += np.dot(X[t].T, dhidden) self.grads["db_h"] += dhidden for grad, key in enumerate(self.grads): np.clip(self.grads[key], -self.clip_value, self.clip_value, out=self.grads[key]) return

Function for update the parameters using AdaGrad

def _update(self): for key in self.params: self.m_params["m" + key] += self.grads["d" + key] * self.grads["d" + key] self.params[key] -= self.grads["d" + key] * self.learning_rate / (np.sqrt(self.m_params["m" + key]) + 1e-8)

A test method that generates a sequence of characters of size *test_size* starting at a given index

def test(self, test_size, start_index): res = "" x = np.zeros((1, self.vocab)) x[0][start_index] = 1 for i in range(test_size): # forward propagation h = np.tanh(np.dot(x, self.params["W_xh"]) + np.dot(self.h0, self.params["W_hh"]) + self.params["b_h"]) y_pred = self._softmax(np.dot(h, self.params["W_hy"]) + self.params["b_y"]) # get a random index from the probability distribution of y index = np.random.choice(range(self.vocab), p=y_pred.ravel()) # set x-one_hot_vector for the next character x = np.zeros((1, self.vocab)) x[0][index] = 1 # find the char with the index and concat to the output string char = self.idx_to_char[index] res += char return res

And finally, the training method that brings this all together.

def train(self, X): loss = [] # trim end of the text so we only get full sequences num_batches = len(X) // self.seq_len X_trimmed = X[:num_batches * self.seq_len] # encode the characters to indices X_encoded = self._encode_text(X_trimmed) for i in range(self.epochs): for j in range(0, len(X_encoded) - self.seq_len, self.seq_len): X_batch, y_batch = self._prepare_batches(X_encoded, j) y_pred, h = self._forward_pass(X_batch) loss = 0 for t in range(self.seq_len): loss += -np.log(y_pred[t][0, np.argmax(y_batch[t])]) self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001 loss.append(self.smooth_loss) self._backward_pass(X_batch, y_batch, y_pred, h) self._update() print(f'Epoch: {i + 1}\tLoss: {loss}') print(self.test(50,2)) return loss, self.params

Now let’s our recurrent neural network in action.

with open('Harry-Potter.txt') as f: text = f.read().lower() # use only a part of the text to make the process faster text = text[:20000] chars = set(text) vocab = len(chars) # creating the encoding decoding dictionaries char_to_idx = {w: i for i, w in enumerate(chars)} idx_to_char = {i: w for i, w in enumerate(chars)} parameter_dict = { 'char_to_idx': char_to_idx, 'idx_to_char': idx_to_char, 'vocab': vocab, 'h_size': 75, 'seq_len': 20, # keep small to avoid diminishing/exploding gradients 'clip_value': 5, 'epochs': 50, 'learning_rate': 1e-2, } model = ReccurentNN(**parameter_dict) loss, params = model.train(text) plt.figure(figsize=(12, 8)) plt.plot([i for i in range(len(loss))], loss) plt.ylabel("Loss") plt.xlabel("Epochs") plt.show() print(model.test(50,10))

is othe on. ogofostheodindearidut wlethallle, st oserarey d -lers amoathe y thasathey at dll tos dn t s med d.). t t ile brs t d g htherive, d ogostare d. ay shag hythay boumay tey thas ot havininggon

Even with all our hacks and tricks the recurrent neural network still suffers from gradient problems and is only able to learn small sequences of characters. In the coming weeks, we’ll introduce more complex recurrent units with gates and try to improve the performance of our RNN.

You can find the “Harry Potter and the Philosopher’s Stone” book text here. The above implementation has been made with a lot of help from this gist, and the code can be found in a Colab notebook here. Also, in hindsight, using the text of a book that contains spells and other non-English words might have made the task unnecessarily harder.