Hey folks! In this tutorial, we will build an RNN and LSTM Model to help us predict nationality based on the name of each character.
Let’s begin by understanding the dataset we have.
Understanding the Dataset
Dataset is a text file containing the name of the person and nationality of the name separated by a comma in each row. The dataset contains more than 20k names and 18 unique nationalities like Portuguese, Irish, Spanish, and many more.
A snapshot of the data is shown below. You can download the dataset here.
Predict Nationality Using People’s Names In Python
Let’s get right into the code implementation. We’ll begin by importing the modules, and then the names and the nationalities dataset that we’ve chosen for this demonstration.
Step 1: Importing Modules
Before we start building any model, we need to import all the required libraries into our program.
from io import open import os, string, random, time, math import matplotlib.pyplot as plt import seaborn as sns import numpy as np from sklearn.model_selection import train_test_split import torch import torch.nn as nn import torch.optim as optim from IPython.display import clear_output
Step 2: Loading the Dataset
To load the dataset, we go through each row in the data and create a list of tuples containing names and nationalities together. This will make it easier for the model to understand the data in the later sections.
languages =  data =  X =  y =  with open("name2lang.txt", 'r') as f: #read the dataset for line in f: line = line.split(",") name = line.strip() lang = line.strip() if not lang in languages: languages.append(lang) X.append(name) y.append(lang) data.append((name, lang)) n_languages = len(languages) print("Number of Names: ", len(X)) print("Number of Languages: ",n_languages) print("All Names: ", X) print("All languages: ",languages) print("Final Data: ", data)
Step 3: Train-test Split
we will split the data into training and testing in the ratio of 80:20 where 80% of the data goes to training and the rest 20% goes to testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123, stratify = y) print("Training Data: ", len(X_train)) print("Testing Data: ", len(X_test))
Training Data: 16040 Testing Data: 4010
Step 4: Encoding Data
The character encodings will be used as input to the sequence model rather than the raw text data. As a result, we must encrypt the input and identify it at the character level.
We need to concatenate all of the character level encodings to get the encodings for the entire word once we’ve created encodings at the character level. This process is carried out for all names and nationalities.CodeText.
all_letters = string.ascii_letters + ".,;" print(string.ascii_letters) n_letters = len(all_letters) def name_rep(name): rep = torch.zeros(len(name), 1, n_letters) for index, letter in enumerate(name): pos = all_letters.find(letter) rep[index][pos] = 1 return rep
The function name_rep above generates a one-time encoding for the names. To begin, we declare a tensor of zeroes with input size equal to the length of the name and outsize equal to the entire number of characters in our list.
Following that, we cycle over each character to identify the index of a letter and set that index position value to 1, leaving the remaining values at 0.
def nat_rep(lang): return torch.tensor([languages.index(lang)], dtype = torch.long)
Encoding nationalities follow a much simpler logic than encoding names. We just determine the index of the occurrence of that particular nationality in our list of nationalities to encode nationality. The index is then assigned as an encoding.
Step 5: Building the Neural Network Model
We will be building an RNN model using Pytorch where we create a class in order to achieve that.
The init function (constructor function) helps us in initializing network characteristics such as weights and biases associated with hidden layers.
class RNN_net(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN_net, self).__init__() self.hidden_size = hidden_size self.i2h = nn.Linear(input_size + hidden_size, hidden_size) self.i2o = nn.Linear(input_size + hidden_size, output_size) self.softmax = nn.LogSoftmax(dim = 1) def forward(self, input_, hidden): combined = torch.cat((input_, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) output = self.softmax(output) return output, hidden def init_hidden(self): return torch.zeros(1, self.hidden_size)
The forward function first concatenates a character’s input and hidden representations and then utilizes it as an input to compute the output label using the i2h, i2o, and softmax layers.
def infer(net, name): net.eval() name_ohe = name_rep(name) hidden = net.init_hidden() for i in range(name_ohe.size()): output, hidden = net(name_ohe[i], hidden) return output n_hidden = 128 net = RNN_net(n_letters, n_hidden, n_languages) output = infer(net, "Adam") index = torch.argmax(output) print(output, index)
The network instance and person name are passed as input arguments to the infer function. We will set the network to evaluation mode and compute the One-Hot representation of the input human name in this function.
Following that, we will compute the hidden representation depending on the hidden size and cycle over all of the characters before returning the computed hidden representation to the network.
Finally, we will compute the output, which is the person’s nationality.
Step 6: Computing Accuracy of the RNN Model
Before moving on to training the model, let’s create a function to compute the accuracy of the model.
To achieve the same, we would be creating an evaluation function that will take the following as input :
- Network instance
- The number of data points
- The value of k
- X and Y testing data
def dataloader(npoints, X_, y_): to_ret =  for i in range(npoints): index_ = np.random.randint(len(X_)) name, lang = X_[index_], y_[index_] to_ret.append((name, lang, name_rep(name), nat_rep(lang))) return to_ret def eval(net, n_points, k, X_, y_): data_ = dataloader(n_points, X_, y_) correct = 0 for name, language, name_ohe, lang_rep in data_: output = infer(net, name) val, indices = output.topk(k) if lang_rep in indices: correct += 1 accuracy = correct/n_points return accuracy
Inside the function we will be performing the following operations:
- Load the data using the
- Iterate all person names present in the data loader.
- Invoke the model on the inputs and getting the outputs.
- Compute the predicted class.
- Calculate the total number of correctly predicted classes
- Return the final percentage.
Step 7: Training the RNN Model
In order to train the model, we will be coding a simple function to train our network.
def train(net, opt, criterion, n_points): opt.zero_grad() total_loss = 0 data_ = dataloader(n_points, X_train, y_train) for name, language, name_ohe, lang_rep in data_: hidden = net.init_hidden() for i in range(name_ohe.size()): output, hidden = net(name_ohe[i], hidden) loss = criterion(output, lang_rep) loss.backward(retain_graph=True) total_loss += loss opt.step() return total_loss/n_points def train_setup(net, lr = 0.01, n_batches = 100, batch_size = 10, momentum = 0.9, display_freq = 5): criterion = nn.NLLLoss() opt = optim.SGD(net.parameters(), lr = lr, momentum = momentum) loss_arr = np.zeros(n_batches + 1) for i in range(n_batches): loss_arr[i + 1] = (loss_arr[i]*i + train(net, opt, criterion, batch_size))/(i + 1) if i%display_freq == display_freq - 1: clear_output(wait = True) print("Iteration number ", i + 1, "Top - 1 Accuracy:", round(eval(net, len(X_test), 1, X_test, y_test),4), 'Top-2 Accuracy:', round(eval(net, len(X_test), 2, X_test, y_test),4), 'Loss:', round(loss_arr[i]),4) plt.figure() plt.plot(loss_arr[1:i], "-*") plt.xlabel("Iteration") plt.ylabel("Loss") plt.show() print("\n\n") n_hidden = 128 net = RNN_net(n_letters, n_hidden, n_languages) train_setup(net, lr = 0.0005, n_batches = 100, batch_size = 256)
After training the model for 100 batches, we are able to achieve a top-1 accuracy of 66.5% and a top-2 accuracy of 79% with the RNN Model.
Step 8: Training on LSTM Model
We will also discuss how to implement the LSTM Model for classifying the name nationality of a person’s name. To achieve the same, we will make use of Pytorch and create a custom LSTM class.
class LSTM_net(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(LSTM_net, self).__init__() self.hidden_size = hidden_size self.lstm_cell = nn.LSTM(input_size, hidden_size) #LSTM cell self.h2o = nn.Linear(hidden_size, output_size) self.softmax = nn.LogSoftmax(dim = 2) def forward(self, input_, hidden): out, hidden = self.lstm_cell(input_.view(1, 1, -1), hidden) output = self.h2o(hidden) output = self.softmax(output) return output.view(1, -1), hidden def init_hidden(self): return (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size)) n_hidden = 128 net = LSTM_net(n_letters, n_hidden, n_languages) train_setup(net, lr = 0.0005, n_batches = 100, batch_size = 256)
After training the model for 100 batches, we are able to achieve a top-1 accuracy of 52.6% and a top-2 accuracy of 66.9% with the LSTM Model.
Congratulations! You just learned how to build a nationality classification model using Pytorch. Hope you enjoyed it! 😇
Liked the tutorial? In any case, I would recommend you to have a look at the tutorials mentioned below:
- Classifying Clothing Images in Python – A complete guide
- Wine Classification using Python – Easily Explained
Thank you for taking your time out! Hope you learned something new!! 😄