Topic Modeling using LSA: A Complete Guide

Developing a seamless and interactive interface between humans and machines will always be a key concern for today’s and tomorrow’s increasingly intelligent applications.

Natural language processing (NLP) is an advancing technology that derives numerous forms of AI that we see in the present.

Talking about NLP, Topic Modeling is one of its most important topics. In this article, let’s try to understand what topic modeling is and how to implement it in python language.

What is Topic Modeling?

NLP’s topic modeling technique is used to infer from the text of a set of documents what they are about. It works by scanning a collection of documents, identifying word and phrase patterns within them, and then automatically clustering word groups and related phrases that best describe a collection of documents. It discovers hidden patterns or data groupings.

In this article, let’s try to implement topic modeling using the Latent Semantic Analysis (LSA) algorithm. But before we start the implementation, let’s understand the concept of LSA.

One can also implement topic modeling using Latent Dirichlet Allocation (LDA). To learn more about it, read Latent Dirichlet Allocation (LDA) Algorithm in Python

What is LSA?

A technique used in natural language processing called latent semantic analysis (LSA), particularly distributional semantics, evaluates connections between a collection of documents and the terms they include by creating a collection of concepts connected to the documents and terms using statistics. It is commonly used for data clustering and data collection under the domain of text analysis.

Concept searching and automatic document classification are also the two main uses of LSA. It is an unsupervised learning approach which means there is a particular target to achieve or no labels or tags are assigned. The goal of latent semantic analysis is to produce representations of the text data in terms of these topics or latent characteristics.

The word latent itself means hidden. We are looking at anything that is latent (hidden) or inherent to the data itself. We will be able to decrease the dimensionality of the original text-based data collection as a byproduct of this.

How does LSA works?

The working of Latent Semantic Analysis primarily involves four steps. the second and third are more crucial and complex to understand. The steps are as given below

Collect raw Text Data
Generate a document term matrix
Perform Singular Value Decomposition (SVD)
Examine Topic encoded data

Collecting Raw Text Data

The data caused for topic modeling is always in the text format. The input provided is generally a collection of documents. To understand the concept, given below is the example. But when working on actual projects, data is usually scraped from various open sources like social media, reports, etc.

Document 1: The smallest country in the world
Document 2: The largest country in the world.
Document 3: The largest city in my country.
Document 4: The smallest city in the world.

Document Term Matrix

A document-term matrix is a mathematical matrix that indicates the frequency of terms that appear in a set of documents. In a document-term matrix, columns represent terms in the collection and rows represent documents in the collection. This matrix is an example of a document-feature matrix, where “features” can relate to more than just terms in a document.

Another frequent format is the transposition, or term-document matrix, where terms are the rows and documents are the columns. The fundamental concept behind a document term matrix is that text documents can be represented as points in Euclidean space, also known as vectors.

Let’s try to understand this concept by an example,

Here, using Scikit-Learn, we’ve defined the four documents we’ve been analyzing as four strings in a list. The count vectorizer model may be used to generate the document term matrix. As you can see, we imported the count vectorizer from the Scikit-Learn feature extraction dot text module.

After creating the count vectorizer, we fit and then convert our body into a collection of words. When we use the phrase “bag of words” in natural language processing, we’re referring to the most basic application of a document term matrix.

Singular value decomposition (SVD)

Singular value decomposition (SVD), a mathematical method, is used to condense a large piece of text into a matrix with word counts per document (rows represent unique words and columns represent each document). This technique reduces the number of rows while maintaining a similar structure among columns.

The cosine of the angle produced by any two vectors formed by columns is then used to compare documents, as is the dot product formed by the normalization of the two vectors. Values near 1 reflect documents that are extremely similar, while values near 0 describe documents that are quite different.

Similar to a principal component analysis, the singular value decomposition If you’re familiar with this statistical method, encoding the original data set with these latent features using latent semantic analysis will minimize its dimensionality. These latent features correspond to the original text data’s subjects.

The next stage is to perform our singular value decomposition; this may be done using the model’s truncated SVD of Scikit-Learn. We import truncated SVD from SK learn decomposition and use it to fit and then transform the bag of words into our LSA. The word “truncated” refers to the fact that we won’t get back as many vectors as we started with.

Topic Encoded Data

This aims to transform our original data into topic-encoded data. The data now should consist of two columns one representing each of the two topics that we requested from the truncated SVD recalling that this value of two was passed as an argument to the truncated SVD.

To view the results of our LSA, we’ll use the pandas library. Here, we can see the four original documents that we started with as well as a numerical value for each of the two subjects. The first and fourth documents are about the largest places, whereas the second and third documents are about the smallest places.

Take note that all four documents are strong in topic one, but there is a clear distinction for topic two between the second and third documents and the first and fourth.

ByProducts of LSA

One great thing about a latent semantic analysis is that it generates a few byproducts that can help us to understand what each topic is encoding two byproducts that we’re going to look at are:

The dictionary – that the dictionary is the set of all words that show up in at least one document in the body.

Encoding matrix – the encoding matrix was used to encode the documents into this topic-encoded representation it can be examined to get a greater understanding of what each topic represents.

Let’s take a look at the dictionary as an attribute of a fit count vectorizer model it can be accessed using the get feature names method. We can examine this matrix to gain an understanding of the topics latent to the body. Let’s look at this encoding matrix where each row represents a word in our dictionary and in each column one of our two topics the numerical values can be thought of as an expression of that word in a given topic.

Next, we’ll interpret the encoding matrix we might be interested in what are the top words for each topic or what dimensions in word space explain most of the variance. In the data note that we’re going to need to look at the absolute value of the expression of each word in the topic a word that has a strong negative representation is just as important as a word that has a strong positive picture when we go to interpret the topics.

Let’s take a look at topic one as you can see the most important word is the word though in later videos we will look at removing words like them that don’t have a lot of meaning. Let’s take a look at topic two as you can see the two most important words are largest and smallest but let’s take a look at the original calculation.

We can see that largest is very strongly positive while the smallest is very strongly negative what this is telling us is that topic two is going to be a great topic for helping us to represent whether that document is about the largest or about the smallest.