Latent Dirichlet Allocation (LDA) Algorithm in Python

LDA (1)

Hello readers, in this article we will try to understand what is LDA algorithm. how it works and how it is implemented in python. Latent Dirichlet Allocation is an algorithm that primarily comes under the natural language processing (NLP) domain.

It is used for topic modelling. Topic modelling is a machine learning technique performed on text data to analyze it and find an abstract similar topic amongst the collection of the documents.

Also read: Depth First Iterative Deepening (DFID) Algorithm in Python

What is LDA?

LDA is one of the topic modelling algorithms specially designed for text data. This technique considers each document as a mixture of some of the topics that the algorithm produces as a final result. The topics are the probability distribution of the words that occur in the set of all the documents present in the dataset.

The result of preprocessed data will provide an array of keywords or tokens, LDA algorithm will take this preprocessed data as input and will try to find hidden/underlying topics based on the probability distribution of these keywords. Initially, the algorithm will assign each word in the document to a random topic out of the ‘n’ number of topics. 

For example, consider the following text data

  • Text 1: Excited for IPL, this year let’s go back to cricket stadiums and enjoy the game.
  • Text 2: We might face the 4th wave of Covid this August!
  • Text 3: Get vaccinated as soon as possible, it’s high time now.
  • Text 4: The Union Budget has increased its quota for sports this year, all thanks to the Olympics winners this year.

Theoretically, let’s consider two topics Sports and Covid for the algorithm to work on. The algorithm may assign the first word that says “IPL” for topic 2 Covid. We know this assignment is wrong, but the algorithm will try to correct this in the future iteration based on two factors that are how often the topic occurs in the document and how often the word occurs in the topic. As there are not many Covid-related terms in text 1 and the word “IPL” will not occur many times in topic 2 Covid, the algorithm may assign the word “IPL” to the new topic that is topic 1 (sports). With multiple such iterations, the algorithm will achieve stability in topic recognition and word distribution across the topics. Finally, each document can be represented as a mixture of determined topics.

Also read: Bidirectional Search in Python

How do LDA works?

The following steps are carried out in LDA to assign topics to each of the documents:

1) For each document, randomly initialize each word to a topic amongst the K topics where K is the number of pre-defined topics.

2)  For each document d:

For each word w in the document, compute:

  • P(topic t| document d): Proportion of words in document d that are assigned to topic t
  • P(word w| topic t): Proportion of assignments to topic t across all documents from words that come from w

3) Reassign topic T’ to word w with probability p(t’|d)*p(w|t’) considering all other words and their topic assignments

The last step is repeated multiple times till we reach a steady state where the topic assignments do not change further. The proportion of topics for each document is then determined from these topic assignments.

Illustrative Example of LDA:

Let us say that we have the following 4 documents as the corpus and we wish to carry out topic modelling on these documents.

  • Document 1: We watch a lot of videos on YouTube.
  • Document 2: YouTube videos are very informative.
  • Document 3: Reading a technical blog makes me understand things easily.
  • Document 4: I prefer blogs to YouTube videos.

LDA modelling helps us in discovering topics in the above corpus and assigning topic mixtures for each of the documents. As an example, the model might output something as given below:

Topic 1: 40% videos, 60% YouTube

Topic 2: 95% blogs, 5% YouTube

Documents 1 and 2 would then belong 100% to Topic 1. Document 3 would belong 100% to Topic 2. Document 4 would belong 80% to Topic 2 and 20% to Topic 1

How to implement LDA in Python?

Following are the steps to implement LDA Algorithm:

  1. Collecting data and providing it as input
  2. Preprocessing the data (removing the unnecessary data)
  3. Modifying data for LDA Analysis
  4. Building and training LDA Model
  5. Analyzing LDA model results

Here, we have the input data collected from Twitter and converted it into a CSV file, as the data on social media is varied and we can build an efficient model.

Importing required libraries for LDA

import numpy as np
import pandas as pd 
import re

import gensim
from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

Cleaning Data

Normalizing Whitespace

def normalize_whitespace(tweet):
    tweet = re.sub('[\s]+', ' ', tweet)
    return tweet

text = "         We        are the students    of    Science. "
print("Text Before: ",text)
text = normalize_whitespace(text)
print("Text After: ",text)

OUTPUT:

 Text Before:    We        are the students    of    Science. 

Text After: We are students of Science.

Removing stopwords

import nltk
nltk.download('stopwords')
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

stop_words = stopwords.words('english')


def remove_stopwords(text):
  final_s=""
  text_arr= text.split(" ")                              #splits sentence when space occurs
  print(text_arr)
  for word in text_arr:                             
    if word not in stop_words:                     # if word is not in stopword then append(join) it to string 
      final_s= final_s + word + " "

  return final_s 

Stemming and Tokenisation

import nltk
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, SnowballStemmer, PorterStemmer

stemmer = PorterStemmer()

def tokenize_stemming(text):
    text = re.sub(r'[^\w\s]','',text)
    #replace multiple spaces with one space
    text = re.sub(r'[\s]+',' ',text)
    #transfer text to lowercase
    text = text.lower() 
    # tokenize text
    tokens = re.split(" ", text)

    # Remove stop words 
    result = []
    for token in tokens :
        if token not in stop_words and len(token) > 1:
            result.append(stemmer.stem(token))

    return result

Also read: Tokenization in Python using NLTK

Term Frequency (TF-IDF)

It is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor.

corpus_doc2bow_vectors = [dictionary.doc2bow(tok_doc) for tok_doc in tokens]
print("# Term Frequency : ")
corpus_doc2bow_vectors[:5]

tfidf_model = models.TfidfModel(corpus_doc2bow_vectors, id2word=dictionary, normalize=False)
corpus_tfidf_vectors = tfidf_model[corpus_doc2bow_vectors]

print("\n# TF_IDF: ")
print(corpus_tfidf_vectors[5])

Also read: Creating a TF-IDF Model from Scratch in Python

Running LDA using Bag of Words

lda_model = gensim.models.LdaMulticore(corpus_doc2bow_vectors, num_topics=10, id2word=dictionary, passes=2, workers=2)

Also read: Creating Bag of Words Model from Scratch in python

Running LDA using TF-IDF

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf_vectors, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Classification of the topics

Performance evaluation by classifying sample documents using the LDA Bag of Words model We will check where our test document would be classified.

for index, score in sorted(lda_model[corpus_doc2bow_vectors[1]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))

Also read: Classify News Headlines in Python – Machine Learning

Performance evaluation by classifying sample documents using the LDA TF-IDF model.

for index, score in sorted(lda_model_tfidf[corpus_doc2bow_vectors[1]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))

Conclusion

In this article, we have tried to understand the most commonly used algorithm under the Natural language processing domain. LDA is the base for topic modelling – a type of statistical modelling and data mining.

References:

https://en.wikipedia.org/wiki/Tf%E2%80%93idf