Name Language Prediction using Recurrent Neural Network in PyTorch

In this article, we will demonstrate the implementation of a Recurrent Neural Network (RNN) using PyTorch in the task of multi-class text classification. This RNN model will be trained on the names of the person belonging to 18 language classes. After successful training, the model will predict the language category for a given name that it is most likely to belong. 
Recurrent Neural Network in PyTorch

Recurrent Neural Networks have been applied very successfully as the deep learning models in the tasks that deal with the sequential data especially the Natural Language Processing. The traditional feed-forward networks operate with the entire fixed training batch at once and produce a fixed amount of output. On the other hand, the recurrent neural networks process the same in sequence. This feature makes them outperforming in many NLP applications. With these capabilities, RNN models are popularly applied in the text classification problems.

In this article, we will demonstrate the implementation of a Recurrent Neural Network (RNN) using PyTorch in the task of multi-class text classification. This RNN model will be trained on the names of the person belonging to 18 language classes. After successful training, the model will predict the language category for a given name that it is most likely to belong. 

Implementation of RNN in PyTorch

This implementation was done in the Google Colab and the data set was read from the Google Drive. The below line of codes will mount the Google Drive to the Colab notebook and print the text files in the data set.

from google.colab import drive
drive.mount('/content/gdrive')



from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os

def printFiles(path):
  return glob.glob(path)

printFiles('gdrive/My Drive/Dataset/data/data/names/*.txt')
Recurrent Neural Network in PyTorch
















The below lines of codes define function modules to convert Unicode text to equivalent ASCII value.

import unicodedata
import string

all_let = string.ascii_letters + " .,;'"
n_let = len(all_let)

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_let
    )

Using the below code snippet, a function will be defined to build the dictionary of categories and a list of names in every language.

cat_line = {}
all_cats = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in printFiles('gdrive/My Drive/Dataset/data/data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_cats.append(category)
    lines = readLines(filename)
    cat_line[category] = lines

n_categories = len(all_cats)

We will check the above function for 4 Japanese names.

#Check names in a category
print(cat_line['Japanese'][:4])


In the next step, the function modules will be defined to turn the names into tensors to make them compatible with the RNN model.

import torch
# Find letter index from all_let, e.g. "a" = 0
def letterToIndex(letter):
    return all_let.find(letter)

# Turn a letter into a <1 x n_let> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_let)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_let>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_let)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

We will check the above module by converting a letter to tensor and a line to tensor.

print(letterToTensor('K'))
print(lineToTensor('Kakinomoto').size())
Recurrent Neural Network in PyTorch




In the next step, we will define the Recurrent Neural Network model.

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
#Binding model
rnn = RNN(n_let, n_hidden, n_categories)

This model will be checked on generating tensor output for a name.

input = lineToTensor('Aalsburg')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

Recurrent Neural Network in PyTorch


This untrained model has generated the likelihoods of all the categories the given input name belongs to. 

Now, we will define functions for providing random training examples to the network during training and generating categories for the network outputs.

import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_cats)
    line = randomChoice(cat_line[category])
    category_tensor = torch.tensor([all_cats.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

#Check on a random sample
for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

Recurrent Neural Network in PyTorch










def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_cats[category_i], category_i
#Check category for an output
print(categoryFromOutput(output))



In the next step, the hyperparameters and the training function will be defined and the RNN model will be trained in 100 epochs.

learning_rate = 0.005 

def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()

    rnn.zero_grad()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item()

import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

Recurrent Neural Network in PyTorch


















After training, we will visualize the loss to see the performance.

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)
















The below code snippet will test the on the unseen texts and plot the confusion matrix.

# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# Just return an output given a line
def evaluate(line_tensor):
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_cats.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
figsize = (10, 10)
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_cats, rotation=90)
ax.set_yticklabels([''] + all_cats)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()
































The below function will print the likelihood of belonging to a language category for the given names.

def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    with torch.no_grad():
        output = evaluate(lineToTensor(input_line))

        # Get top N categories
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_cats[category_index]))
            predictions.append([value, all_cats[category_index]])

Finally, we will check the predicted likelihoods for the given three names.

predict('Aggelen')
predict('Accardo')
predict('Ferreiro')

Recurrent Neural Network in PyTorch













So, as we can see above, the RNN model has given the likelihoods for the given names which of the language categories they belong to. For example, for the name ‘Aggelen’, it has given the top 3 likelihoods in which ‘French’ has the highest value. All three predictions are correct. That means, according to the trained RNN model, the name ‘Aggelen’ has the highest chances of belonging to the ‘French’ language. We could apply the argmax to print only the language category with the highest likelihood, but to make it more clear, the top 3 predictions are given in the result. You can check this model on more numbers of predictions and tune the parameters to improve the accuracy.

References:-

  1. Gabriel Loye, ‘A Beginner’s Guide on Recurrent Neural Networks with PyTorch’
  2. ‘NLP from Scratch: Classifying Names with a Character-Level RNN’, PyTorch Tutorial.

Download our Mobile App

Dr. Vaibhav Kumar
Dr. Vaibhav Kumar is a seasoned data science professional with great exposure to machine learning and deep learning. He has good exposure to research, where he has published several research papers in reputed international journals and presented papers at reputed international conferences. He has worked across industry and academia and has led many research and development projects in AI and machine learning. Along with his current role, he has also been associated with many reputed research labs and universities where he contributes as visiting researcher and professor.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.