Search

# Implementing A Recurrent Neural Network (RNN) From Scratch

In this article we implement a character level recurrent neural network (RNN) from scratch in Python using NumPy.

Fully-connected neural networks and CNNs all learn a one-to-one mapping, for instance, mapping images to the number in the image or mapping given values of features to a prediction.  The gist is that the size of the input is fixed in all these “vanilla” neural networks. In this article, we’ll understand and build Recurrent Neural Network (RNNs), which learn functions that can be one-to-many, many-to-one, many-to-many. But what does that mean? They take as input sequences, such as speech, natural language, time series, or video. So when there’s a lot of information being conveyed sequentially, in the temporal change of the data, that’s where recurrent neural networks thrive.

### Formulating the Neural Network

Let’s take the example of a “many-to-many” RNN because that’s the problem type we’ll be working on. The inputs and outputs are denoted by x0, x1, … xn and y0, y1, … yn, respectively, where xi and yi are vectors with arbitrary dimensions. RNNs learn the temporal information with the help of a hidden state h, which is also a vector with arbitrary dimension.

For any given time step, the hidden state ht is calculated using the previous hidden state ht-1 and the current input ht:

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

This ht vector is then used to calculate the output yt:

Here Wxh, Whh, and Why are the weight matrices for xt -> ht mappings,  ht-1 -> ht mappings and ht -> yt mappings, respectively. And bh and by are the bias vectors.

As powerful as recurrent neural networks are, they’re highly susceptible to gradient related problems in training. A network with n hidden layers will have n derivatives that will be multiplied together. If these derivatives are large, the gradient increases exponentially as it propagates backwards until it eventually explodes. This is called the problem of exploding gradient. Alternatively, if the derivatives are small, the gradient decreases exponentially as it is propagated until it eventually vanishes, and this is called the vanishing gradient problem.

We’ll use the following “tricks” to help minimize the effect of these issues:

• Weight Initialization – Initializing the weights to identity matrices and biases to zero help prevent the weights from shrinking to zero. You can learn more about this here.

### Implementing a Recurrent Neural Network

We will be building a character level prediction RNN and train in on the text of “Harry Potter and the Philosopher’s Stone” because why not. Let’s start by initializing the model parameters, weights and biases.

``` import numpy as np
import matplotlib.pyplot as plt
class ReccurentNN:
def __init__(self, char_to_idx, idx_to_char, vocab, h_size=75,
seq_len=20, clip_value=5, epochs=50, learning_rate=1e-2):
self.n_h = h_size
self.seq_len = seq_len
self.clip_value = clip_value
self.epochs = epochs
self.learning_rate = learning_rate
self.char_to_idx = char_to_idx
self.idx_to_char = idx_to_char
self.vocab = vocab
# smoothing loss as batch SGD is noisy
self.smooth_loss = -np.log(1.0 / self.vocab) * self.seq_len
# initialize parameters
self.params = {}
self.params["W_xh"] = np.random.randn(self.vocab, self.n_h) * 0.01
self.params["W_hh"] = np.identity(self.n_h) * 0.01
self.params["b_h"] = np.zeros((1, self.n_h))
self.params["W_hy"] = np.random.randn(self.n_h, self.vocab) * 0.01
self.params["b_y"] = np.zeros((1, self.vocab))
self.h0 = np.zeros((1, self.n_h))  # value of the hidden state at time step t = -1
self.m_params = {}
for key in self.params:
self.m_params["m" + key] = np.zeros_like(self.params[key]) ```

The loss function of SGD is noisy so we’re smoothing it, and we’re using AdaGrad to adapt the learning rate based on the parameters and data observed in earlier iterations.

Create the functions for encoding the text characters and creating batches

``` def _encode_text(self, X):
X_encoded = []
for char in X:
X_encoded.append(self.char_to_idx[char])
return X_encoded

def _prepare_batches(self, X, index):
X_batch_encoded = X[index: index + self.seq_len]
y_batch_encoded = X[index + 1: index + self.seq_len + 1]
X_batch = []
y_batch = []
for i in X_batch_encoded:
one_hot_char = np.zeros((1, self.vocab))
one_hot_char[0][i] = 1
X_batch.append(one_hot_char)
for j in y_batch_encoded:
one_hot_char = np.zeros((1, self.vocab))
one_hot_char[0][j] = 1
y_batch.append(one_hot_char)
return X_batch, y_batch ```

Create the softmax method that takes the final logits and gives probabilities

```     def _softmax(self, x):
e_x = np.exp(x - np.max(x))
return e_x / np.sum(e_x) ```

The maximum logit value is subtracted to improve numerical stability; you can learn more about this here. Implement the forward pass function that takes the current input sequence xt and previous hidden state ht to  calculates ht and yt

```     def _forward_pass(self, X):
h = {}  # stores hidden states
h[-1] = self.h0  # set initial hidden state at t=-1
y_pred = {}  # stores softmax output probabilities
# iterate over each character in the input sequence
for t in range(self.seq_len):
h[t] = np.tanh(
np.dot(X[t], self.params["W_xh"]) + np.dot(h[t - 1], self.params["W_hh"]) + self.params["b_h"])
y_pred[t] = self._softmax(np.dot(h[t], self.params["W_hy"]) + self.params["b_y"])
self.ho = h[t]
return y_pred, h ```

Create the function that backpropagates the error and calculate the gradients

```     def _backward_pass(self, X, y, y_pred, h):
dh_next = np.zeros_like(h[0])
for t in reversed(range(self.seq_len)):
dy = np.copy(y_pred[t])
dy[0][np.argmax(y[t])] -= 1  # predicted y - actual y
dhidden = (1 - h[t] ** 2) * (np.dot(dy, self.params["W_hy"].T) + dh_next)
dh_next = np.dot(dhidden, self.params["W_hh"].T)
self.grads["dW_hh"] += np.dot(h[t - 1].T, dhidden)
return ```

```     def _update(self):
for key in self.params:
self.params[key] -= self.grads["d" + key] * self.learning_rate / (np.sqrt(self.m_params["m" + key]) + 1e-8) ```

A test method that generates a sequence of characters of size test_size starting at a given index

```     def test(self, test_size, start_index):
res = ""
x = np.zeros((1, self.vocab))
x[0][start_index] = 1
for i in range(test_size):
# forward propagation
h = np.tanh(np.dot(x, self.params["W_xh"]) + np.dot(self.h0, self.params["W_hh"]) + self.params["b_h"])
y_pred = self._softmax(np.dot(h, self.params["W_hy"]) + self.params["b_y"])
# get a random index from the probability distribution of y
index = np.random.choice(range(self.vocab), p=y_pred.ravel())
# set x-one_hot_vector for the next character
x = np.zeros((1, self.vocab))
x[0][index] = 1
# find the char with the index and concat to the output string
char = self.idx_to_char[index]
res += char
return res ```

And finally, the training method that brings this all together.

```     def train(self, X):
loss = []
# trim end of the text so we only get full sequences
num_batches = len(X) // self.seq_len
X_trimmed = X[:num_batches * self.seq_len]
# encode the characters to indices
X_encoded = self._encode_text(X_trimmed)
for i in range(self.epochs):
for j in range(0, len(X_encoded) - self.seq_len, self.seq_len):
X_batch, y_batch = self._prepare_batches(X_encoded, j)
y_pred, h = self._forward_pass(X_batch)
loss = 0
for t in range(self.seq_len):
loss += -np.log(y_pred[t][0, np.argmax(y_batch[t])])
self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
loss.append(self.smooth_loss)
self._backward_pass(X_batch, y_batch, y_pred, h)
self._update()
print(f'Epoch: {i + 1}\tLoss: {loss}')
print(self.test(50,2))
return loss, self.params ```

Now let’s our recurrent neural network in action.

``` with open('Harry-Potter.txt') as f:
# use only a part of the text to make the process faster
text = text[:20000]
chars = set(text)
vocab = len(chars)
# creating the encoding decoding dictionaries
char_to_idx = {w: i for i, w in enumerate(chars)}
idx_to_char = {i: w for i, w in enumerate(chars)}
parameter_dict = {
'char_to_idx': char_to_idx,
'idx_to_char': idx_to_char,
'vocab': vocab,
'h_size': 75,
'seq_len': 20,  # keep small to avoid diminishing/exploding gradients
'clip_value': 5,
'epochs': 50,
'learning_rate': 1e-2,
}
model = ReccurentNN(**parameter_dict)
loss, params = model.train(text)
plt.figure(figsize=(12, 8))
plt.plot([i for i in range(len(loss))], loss)
plt.ylabel("Loss")
plt.xlabel("Epochs")
plt.show()
print(model.test(50,10)) ```
`is othe on. ogofostheodindearidut wlethallle, st oserarey d -lers amoathe y thasathey at dll tos dn t s med d.). t t ile brs t d g htherive, d ogostare d. ay shag hythay boumay tey thas ot havininggon`

Even with all our hacks and tricks the recurrent neural network still suffers from gradient problems and is only able to learn small sequences of characters. In the coming weeks, we’ll introduce more complex recurrent units with gates and try to improve the performance of our RNN.

You can find the “Harry Potter and the Philosopher’s Stone” book text here. The above implementation has been made with a lot of help from this gist, and the code can be found in a Colab notebook here. Also, in hindsight, using the text of a book that contains spells and other non-English words might have made the task unnecessarily harder.

##### References:
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Data Science Hiring Process at Pegasystems

For data science roles, Pega focuses on the candidate’s ability to learn and adapt rather

### Alibaba’s Cloud Business Gets Qwen-ched!

Alibaba’s latest Qwen model can save its failing cloud business.

### Genpact, AWS Collaborate to Revolutionise Insurance Claims Lifecycle

Genpact also announced an expanded collaboration with AWS aimed at revolutionising financial crime risk operations

### 7 Bizarre Things About ChatGPT You Wish You Knew

Ever wondered why it’s called ChatGPT?

### China Open Sources DeepSeek LLM, Outperforms Llama 2 and Claude-2

DeepSeek LLM 7B/67B models, including base and chat versions, are released to the public on

### Apple’s Scary New Innovation Gives Voice to the Voiceless

Apple’s latest innovation, Personal Voice, unveiled just before the International Day of Persons with Disabilities,

### 9 Must-Know Open Source Models From Meta in 2023

Meta has been synonymous with open source ecosystems. Recently, its research arm, FAIR, completed 10

### AI Assists Production in Indian Film Industry

Implementing AI in pre-production can bring down storyboarding process time by 50-80% and reduce the

### Is GPT-4 Really Better than Radiologists?

“Radiology report summaries created by GPT-4 are comparable, and in some cases, even preferred over

### TSMC: The Wizard Behind AI’s Curtain

TSMC anticipates a substantial CAGR of nearly 50% in the AI sector from 2022 to 2027.