Predict Nationality Based On Name In Python

Hey folks! In this tutorial, we will build an RNN and LSTM Model to help us predict nationality based on the name of each character.

Let’s begin by understanding the dataset we have.

Understanding the Dataset

Dataset is a text file containing the name of the person and nationality of the name separated by a comma in each row. The dataset contains more than 20k names and 18 unique nationalities like Portuguese, Irish, Spanish, and many more.

A snapshot of the data is shown below. You can download the dataset here.

Predict Nationality Using People’s Names In Python

Let’s get right into the code implementation. We’ll begin by importing the modules, and then the names and the nationalities dataset that we’ve chosen for this demonstration.

Step 1: Importing Modules

Before we start building any model, we need to import all the required libraries into our program.

from io import open
import os, string, random, time, math
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
import torch 
import torch.nn as nn
import torch.optim as optim
from IPython.display import clear_output

Step 2: Loading the Dataset

To load the dataset, we go through each row in the data and create a list of tuples containing names and nationalities together. This will make it easier for the model to understand the data in the later sections.

languages = []
data = []
X = []
y = []

with open("name2lang.txt", 'r') as f:
    #read the dataset
    for line in f:
        line = line.split(",")
        name = line[0].strip()
        lang = line[1].strip()
        if not lang in languages:
            languages.append(lang)
        X.append(name)
        y.append(lang)
        data.append((name, lang))

n_languages = len(languages)
print("Number of  Names: ", len(X))
print("Number of Languages: ",n_languages)
print("All Names: ", X)
print("All languages: ",languages)
print("Final Data: ", data)

Step 3: Train-test Split

we will split the data into training and testing in the ratio of 80:20 where 80% of the data goes to training and the rest 20% goes to testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123, stratify = y)
print("Training Data: ", len(X_train))
print("Testing Data: ", len(X_test))

Training Data:  16040
Testing Data:  4010

Step 4: Encoding Data

The character encodings will be used as input to the sequence model rather than the raw text data. As a result, we must encrypt the input and identify it at the character level.

We need to concatenate all of the character level encodings to get the encodings for the entire word once we’ve created encodings at the character level. This process is carried out for all names and nationalities.CodeText.

all_letters = string.ascii_letters + ".,;"
print(string.ascii_letters)
n_letters = len(all_letters)

def name_rep(name):
  rep = torch.zeros(len(name), 1, n_letters)
  for index, letter in enumerate(name):
    pos = all_letters.find(letter)
    rep[index][0][pos] = 1
  return rep

The function name_rep above generates a one-time encoding for the names. To begin, we declare a tensor of zeroes with input size equal to the length of the name and outsize equal to the entire number of characters in our list.

Following that, we cycle over each character to identify the index of a letter and set that index position value to 1, leaving the remaining values at 0.

def nat_rep(lang):
    return torch.tensor([languages.index(lang)], dtype = torch.long)

Encoding nationalities follow a much simpler logic than encoding names. We just determine the index of the occurrence of that particular nationality in our list of nationalities to encode nationality. The index is then assigned as an encoding.

Step 5: Building the Neural Network Model

We will be building an RNN model using Pytorch where we create a class in order to achieve that.

The init function (constructor function) helps us in initializing network characteristics such as weights and biases associated with hidden layers.

class RNN_net(nn.Module):
    
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN_net, self).__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim = 1)
    
    def forward(self, input_, hidden):
        combined = torch.cat((input_, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden
    
    def init_hidden(self):
        return torch.zeros(1, self.hidden_size)

The forward function first concatenates a character’s input and hidden representations and then utilizes it as an input to compute the output label using the i2h, i2o, and softmax layers.

def infer(net, name):
    net.eval()
    name_ohe = name_rep(name)
    hidden = net.init_hidden()
    for i in range(name_ohe.size()[0]):
        output, hidden = net(name_ohe[i], hidden)
    return output
n_hidden = 128
net = RNN_net(n_letters, n_hidden, n_languages)
output = infer(net, "Adam")
index = torch.argmax(output)
print(output, index)

The network instance and person name are passed as input arguments to the infer function. We will set the network to evaluation mode and compute the One-Hot representation of the input human name in this function.

Following that, we will compute the hidden representation depending on the hidden size and cycle over all of the characters before returning the computed hidden representation to the network.

Finally, we will compute the output, which is the person’s nationality.

Step 6: Computing Accuracy of the RNN Model

Before moving on to training the model, let’s create a function to compute the accuracy of the model.

To achieve the same, we would be creating an evaluation function that will take the following as input :

Network instance
The number of data points
The value of k
X and Y testing data

def dataloader(npoints, X_, y_):
    to_ret = []
    for i in range(npoints):
        index_ = np.random.randint(len(X_))
        name, lang = X_[index_], y_[index_]
        to_ret.append((name, lang, name_rep(name), nat_rep(lang)))

    return to_ret

def eval(net, n_points, k, X_, y_):
     data_ = dataloader(n_points, X_, y_)
     correct = 0

     for name, language, name_ohe, lang_rep in data_:
         output = infer(net, name)
         val, indices = output.topk(k)
         if lang_rep in indices:
             correct += 1
     accuracy = correct/n_points
     return accuracy

Inside the function we will be performing the following operations:

Load the data using the data loader.
Iterate all person names present in the data loader.
Invoke the model on the inputs and getting the outputs.
Compute the predicted class.
Calculate the total number of correctly predicted classes
Return the final percentage.

Step 7: Training the RNN Model

In order to train the model, we will be coding a simple function to train our network.

def train(net, opt, criterion, n_points):
    opt.zero_grad()
    total_loss = 0
    data_ = dataloader(n_points, X_train, y_train)
    for name, language, name_ohe, lang_rep in data_:
        hidden = net.init_hidden()
        for i in range(name_ohe.size()[0]):
            output, hidden = net(name_ohe[i], hidden)
        loss = criterion(output, lang_rep)
        loss.backward(retain_graph=True)
        total_loss += loss  
    opt.step()       
    return total_loss/n_points

def train_setup(net, lr = 0.01, n_batches = 100, batch_size = 10, momentum = 0.9, display_freq = 5):
    criterion = nn.NLLLoss()
    opt = optim.SGD(net.parameters(), lr = lr, momentum = momentum)
    loss_arr = np.zeros(n_batches + 1)
    for i in range(n_batches):
        loss_arr[i + 1] = (loss_arr[i]*i + train(net, opt, criterion, batch_size))/(i + 1)
        if i%display_freq == display_freq - 1:
            clear_output(wait = True)
            print("Iteration number ", i + 1, "Top - 1 Accuracy:", round(eval(net, len(X_test), 1, X_test, y_test),4), 'Top-2 Accuracy:', round(eval(net, len(X_test), 2, X_test, y_test),4), 'Loss:', round(loss_arr[i]),4)
            plt.figure()
            plt.plot(loss_arr[1:i], "-*")
            plt.xlabel("Iteration")
            plt.ylabel("Loss")
            plt.show()
            print("\n\n")
n_hidden = 128
net = RNN_net(n_letters, n_hidden, n_languages)
train_setup(net, lr = 0.0005, n_batches = 100, batch_size = 256)

After training the model for 100 batches, we are able to achieve a top-1 accuracy of 66.5% and a top-2 accuracy of 79% with the RNN Model.

Step 8: Training on LSTM Model

We will also discuss how to implement the LSTM Model for classifying the name nationality of a person’s name. To achieve the same, we will make use of Pytorch and create a custom LSTM class.

class LSTM_net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM_net, self).__init__()
        self.hidden_size = hidden_size
        self.lstm_cell = nn.LSTM(input_size, hidden_size) #LSTM cell
        self.h2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim = 2)

    def forward(self, input_, hidden):
        out, hidden = self.lstm_cell(input_.view(1, 1, -1), hidden)
        output = self.h2o(hidden[0])
        output = self.softmax(output)
        return output.view(1, -1), hidden

    def init_hidden(self):
        return (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))

n_hidden = 128
net = LSTM_net(n_letters, n_hidden, n_languages)
train_setup(net, lr = 0.0005, n_batches = 100, batch_size = 256)

After training the model for 100 batches, we are able to achieve a top-1 accuracy of 52.6% and a top-2 accuracy of 66.9% with the LSTM Model.