LSTM Using Pytorch – A Simple Guide

Lstms

The human brain works in an amazing way. We remember those embarrassing moments and all those subjects we crammed into our school and college days. What if machines could also remember what we tell them (to some extent)?

We already know we can build models that process our input and remember it for the following task with the help of deep learning. These models are called neural networks, and an example of memory-based neural networks is Recurrent Neural networks (RNNs).

RNNs do work well but are forgetful and also have short-term memory loss. An extension of these neural networks is called Long Short-Term Memory (LSTM). These LSTMs solve the memory problem of RNNs.

Let us dive into the concept of LSTMs.

What Are Recurrent Neural Networks?

RNNs are mainly used in applications like next-word prediction, where the previous input is very important for processing the next word. The RNN architecture works something like this: the output of the previous node is fed as an input to the next node, which also differentiates RNNs from traditional neural networks as the traditional ones treat inputs and outputs independently. We can also derive that RNNs work well with sequential data.

There are a few problems with RNNs. These are the vanishing gradient problem and the exploding gradient. Conveniently, LSTMs resolve these two problems.

LSTM Architecture

I’ll break down the architecture of LSTM in the simplest manner possible.

LSTM Architecture
LSTM Architecture

The LSTMs usually contain cell states(ct) which are the memory cells of the LSTMs. The information can be added, removed, or modified in these gates.

Then we have the hidden states(ht) which stores the output of the previous cell at a particular time stamp.

Apart from cells, the LSTM also has gates to perform modifications on the information stored in the cells.

These are the input gate, forget gate, and output gate.

As their names suggest, the input and output gates allow information to pass through. The forget gate is the main character since it helps in solving the short-term memory loss of the RNNs. It determines how much of the previous information should be retained and how much should be forgotten.

Related: Deep Learning with PyTorch

LSTM With Pytorch

Pytorch is a dedicated library for building and working with deep learning models. Pytorch also has an instance for LSTMs. The syntax of the LSTM class is given below.

torch.nn.LSTM(*args, **kwargs)

The important parameters of the class are

  • input_size – This is the number of features we give as input
  • hidden_size – Consists of the features to be included in the hidden state h
  • num_layers – The number of recurrent layers we want in our model
  • bias – The default of this parameter is True. If True, the model is included with a bias
  • batch_first – This parameter is used to specify the order of the batch, seq, feature parameters
  • dropout –This argument is used to include a dropout parameter
  • bidirectional – This parameter is used to create a Bidirectional LSTM if True

Suggested Read: Predict the nationality based on the name using RNN and LSTM

An Application of LSTM – POS Tagging

POS tagging is one of the most popular applications of natural language processing (NLP). POS Tagging stands for Parts of Speech tagging. Parts of speech in the English language are used to describe the daily words we use. There are eight pos in the English language, such as Nouns, Determinants, Verbs, Adjectives, and so on.

We are going to train a model on tagged data and then provide an input to see how well the LSTM model predicts the parts of speech of each word in a sentence.

We will go through the code step by step, so stick through till the end!

Importing the Necessary Libraries

import pandas as pd
import numpy as np 
import torch 
import torch.nn as nn
import torch.nn .functional as F
import torch.optim as optim 

Pandas and Numpy are used to manipulate data and calculations. The PyTorch library is imported as torch. The Neural Networks module is imported as nn and the Functional module is also imported, which has a collection of activation functions. The optimizer module is also imported to optimize the model.

Importing Nltk Library and Data

We are going to import nltk library and also download the pre-tagged dataset and treebank.

import nltk 
nltk.download('treebank')
nltk.download('universal_tagset')

The treebank has all the possible words available in the English language, and the universal tagset has a dataset of words that are already tagged.

tag_sentence =nltk.corpus.treebank.tagged_sents(tagset='universal')
print("The number of tagged sentences:", len(tag_sentence))

In this snippet, we are loading the tagged sentences into a variable called tag_sentence and printing the number of sentences in the next line.

Image 54
Loading and Printing the sentences

Now, we are going to create separate functions to convert each word, character, and associated tag to a corresponding index number.

def word_to_ix(word, ix):
    return torch.tensor(ix[word], dtype = torch.long)

def char_to_ix(char, ix):
    return torch.tensor(ix[char], dtype= torch.long)

def tag_to_ix(tag, ix):
    return torch.tensor(ix[tag], dtype= torch.long)

def sequence_to_idx(sequence, ix):
    return torch.tensor([ix[s] for s in sequence], dtype=torch.long)

Just like the numpy library treats each input as a ndarray, the PyTorch library uses tensors. The long datatype of the torch library is used in all the functions.

In the following code, we are looking through each sentence, then a word in the sentence and its corresponding pos tag, to store each of them in the dictionaries created in the first three lines.

wordidx = {}
tagidx = {}
charidx = {}
for sent in tag_sentence:
    for word, pos_tag in sentence:
        if word not in word_to_idx.keys():
            worddx[word] = len(wordidx)
        if pos_tag not in tagidx.keys():
            tagidx[pos_tag] = len(tagidx)
        for char in word:
            if char not in charidx.keys():
                charidx[char] = len(charidx)

Here is a sample of the functions we created in the previous code snippet.

Word to Index
Word to Index

So according to the above image, each of the words present in the tagset is given an index.

Creating the Embeddings and Defining the Epoch

We are going to define separate word embeddings for the vectors we got earlier. We are also going to define how many times our model has to run in the training process.

WORD_EMBEDDING = 1024
CHAR_EMBEDDING = 128
WORD_HIDDEN= 1024
CHAR_HIDDEN = 1024
EPOCHS = 5

Now we are going to define the LSTM class and the activation function.

class DualLSTMTagger(nn.Module):
    def __init__(self, word_embedding, word_hidden, char_embedding,\
            char_hidden, word_vocab_size, char_vocab_size, tag_vocab_size):
        super(DualLSTMTagger, self).__init__()
        self.word_embedding = nn.Embedding(word_vocab_size, word_embedding)
        
        self.char_embedding = nn.Embedding(char_vocab_size, char_embedding)
        self.char_lstm = nn.LSTM(char_embedding, char_hidden)
        
        self.lstm = nn.LSTM(word_embedding + char_hidden_dim, word_hidden)
        self.hidden2tag = nn.Linear(word_hidden, tag_vocab_size)
        
    def forward(self, sentence, words):
        embeds = self.word_embedding(sentence)
        char_hidden_final = []
        for word in words:
            char_embeds = self.char_embedding(word)
            _, (char_hidden, char_cell_state) = self.char_lstm\
            (char_embeds.view(len(word), 1, -1))
            word_char_hidden_state = char_hidden.view(-1)
            char_hidden_final.append(word_char_hidden_state)
        char_hidden_final = torch.stack(tuple(char_hidden_final))
        
        combined = torch.cat((embeds, char_hidden_final), 1)

        lstm_out, _ = self.lstm(combined.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In the init method, we are defining the model’s architecture with layers for processing and tag prediction.

The scores of the most probable pos tags are returned in the forward function using the softmax function.

Next, we are going to define the model, the loss function, and the optimizer.

model = DualLSTMTagger(WORD_EMBEDDING, WORD_HIDDEN, CHAR_EMBEDDING,\
                       CHAR_HIDDEN, word_vocab_size, char_vocab_size,\
                       tag_vocab_size)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
if use_cuda:
    model.cuda()
loss_function = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

We have defined the model with the vocabularies and sizes(word embeddings and character embeddings). CUDA is a parallel processor used to reduce the weight of the system used to run the model(If you do not use a parallel processor, you can bid goodbye to your system while running any neural network model 🙁 ).

The loss function and optimizer are in the last two lines.

Training the Model

Now, here comes the much-awaited model training!

print("Training Started")
accuracy_list = []
loss_list = []
interval = round(len(train) / 100.)
epochs = EPOCHS
e_interval = max(round(epochs / 10.), 1)
for epoch in range(epochs):
    acc = 0 #to keep track of accuracy
    loss = 0 # To keep track of the loss value
    i = 0
    for sentence_tag in train:
        i += 1
        words = [torch.tensor(sequence_to_idx(s[0], char_to_idx),\
                              dtype=torch.long).to(device) for s in sentence_tag]
        sentence = [s[0] for s in sentence_tag]
        sentence = torch.tensor(sequence_to_idx(sentence, word_to_idx),\
                                dtype=torch.long).to(device)
        targets = [s[1] for s in sentence_tag]
        targets = torch.tensor(sequence_to_idx(targets, tag_to_idx),\
                               dtype=torch.long).to(device)
        
        model.zero_grad()
        
        tag_scores = model(sentence, words)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()
        loss += loss.item()
        _, indices = torch.max(tag_scores, 1)
        acc += torch.mean(torch.tensor(targets == indices, dtype=torch.float))
        if i % interval == 0:
            print("Epoch {} Running:\t{}% Complete".\
                  format(epoch + 1, i / interval), end = "\r", flush = True)
    loss = loss / len(train)
    acc = acc / len(train)
    loss_list.append(float(loss))
    accuracy_list.append(float(acc))
    if (epoch + 1) % e_interval == 0:
        print("Epoch {} Completed,\tLoss {}\tAccuracy: {}".\
              format(epoch + 1, np.mean(loss_list[-e_interval:]),\
                     np.mean(accuracy_list[-e_interval:])))

In the first five lines, we are creating empty lists to keep track of accuracy and loss. The interval variable is used to determine how often the epochs are run. The number of epochs is set to 5 (which means the training occurs five times).

For each epoch, we are calculating the loss and accuracy. The model’s predictions are stored in tag_scores and the optimizer is used to update the model’s performance.

Model Training and Epochs
Model Training and Epochs

New Test Data

Let us give a sentence as input and see what the model predicts.

seq = "everybody eat the food . I kept looking out the window , \
trying to find the one I was waiting for .".split()
print("Running a check on the model after training.\nSentences:\n{}".\
      format(" ".join(seq)))
with torch.no_grad():
    words = [torch.tensor(sequence_to_idx(s[0], char_to_idx),\
                          dtype=torch.long).to(device) for s in seq]
    sentence = torch.tensor(sequence_to_idx(seq, word_to_idx),\
                            dtype=torch.long).to(device)
        
    tag_scores = model(sentence, words)
    _, indices = torch.max(tag_scores, 1)
    res = []
    for i in range(len(indices)):
        for key, value in tag_to_idx.items():
            if indices[i] == value:
                res.append((seq[i], key))
    print(res)

The input we wish to predict is stored in a variable called seq. The forward function we saw earlier is called to predict the tags.

The max function is used to take the highest predicted tag.

The pos tags of the sentence are stored in the dictionary called res.

POS Tagging
POS Tagging

Let us see how close our model is to correctly tagging the sentence with the help of an online tool.

Online Pos Tagger
Online Pos Tagger

Conclusion

To summarize, LSTMs are a type of recurrent neural network (RNN) with remarkable memory power. They have the ability to remember what they need to (called selective memory), unlike RNNs, which completely forget the previous information due to long-term dependency.

LSTMs work great with sequential data and are often used in applications like next-word prediction, named entity recognition, and other Natural Language processing (NLP) tasks.

In this tutorial, we have learned about the LSTM networks, their architecture, and how they are an advancement of the RNNs.

We have also used LSTM with PyTorch to implement POS Tagging.

References

LSTM PyTorch Documentation

Understanding LSTM Networks

Treebank Example – Penn

Online POS Tagger