Text Mining in Python - A Complete Guide

Today, we are going to learn a very exciting topic which is text mining in Python. We are going to learn some important modules, and some important methods as well. Before getting deeper into let’s have a quick look at our table of content.

What is Text Mining in Python?

Before getting started let’s understand what text mining really is.

Text mining is the process of extracting information from text data. It involves a variety of tasks such as text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and context-related modeling. It uses information retrieval, lexical analysis, and study of word frequency and pattern recognition techniques to find links and associations. It also uses visualization and natural language processing algorithms and analytical methods.

Advantages of Text Mining

It structures unstructured data for further analysis.
It reduces the risks of false insights.
It creates informative decisions, automotive processes, and research analysis.

Applications of Text Mining

Information Extraction: Identification and extraction of relevant facts from the unstructured data.
Document classifications and clustering: It aims for grouping and categorizes terms by classifications and clustering methods.
Information Retrieval: It aims at stringifying the retrieved text in a text document.
Natural Language Processing: It is a major part of text mining. Here we use different computational tasks for understanding and analyzing the unstructured data from a text file.

Implementing Text Mining in Python

Today we are going to have our text mining and we will try a simple text mining problem on our computer. Let us enter our code snippets without getting late. We are going to use our google collab. We will move step-by-step and complete our requirements.

Step 1: Importing modules

We need to import some modules in order to do our work.

Codecs module: It is used to implement a class for file encoding and decoding, and provides access to manage errors in the lookup process.
Collection module: It is used as an alternative for containers like dicts, lists, tuples, and Sets.
NLTK (Natural Language Toolkit): It is a library used for natural language processing.
Matplotlib library: It is used for the visualization of our data (It may be a graph or chart).
NumPy library: It is used for working on arrays. It provides some inbuilt methods to work on supported by the NumPy library.
Pandas library: It is used for data manipulation and analysis in the python programming language.
nltk.stem: It is a sub-package of the nltk module that removes morphological affixes(extra product) for a string or contents. We are importing the porter stemmer algorithm from this subpackage to get our stemming done.
nltk.tokenize: It helps us to split the text into tokens.

The last two packages are used to calculate the number of tokens in text files. We will create a function total_tokens() We will use these all in the same function.

import codecs
import collections

import numpy as np
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
import matplotlib

In case any error occurs while importing our modules you can install them on your command prompt using the pip installer as follows.

!pip install nltk
!pip install pandas
!pip install numpy
!pip install matplotlib

Step 2: Reading text file

The codecs modules provide codecs.open() method to read and write Unicode-encoded files. This is useful for reading files that include characters from many different languages.

with codecs.open("/content/text1.txt", "r", encoding="utf-8") as f:
  text1=f.read()
with codecs.open("/content/text2.txt", "r", encoding="utf-8") as f:
  text2=f.read()

Step 3: Creating Functions

We are using the WordPunctTokenizer().tokenize() method to count the total number of tokens in our text file. This will help us to work on our data file more easily.

We are using the collection.counter() method to store each individual token as keys in a dictionary, and their count as the corresponding values. This way, after implementation, the function below will return a list of dictionaries, as well as the total number of tokens.

def total_tokens(text):
 n = WordPunctTokenizer().tokenize(text)
 return collections.Counter(n), len(n)

We can create a function to calculate the relative and absolute frequency for the most common words in our text file. By doing this, we can create a dataframe that will return the results.

#function to create relative frequency and absolute frequency for most common words
def make_df(counter, size):
 #getting absolute and relative frequencies for each tokens in the counter list  
 absolute_frequency = np.array([el[1] for el in counter])
 relative_frequency = absolute_frequency / size

 #creating a data frame using obtained data above(absolute_frequency & relative_frequency)
 df = pd.DataFrame(data=np.array([absolute_frequency, relative_frequency]).T, index = [el[0] for el in counter], columns=["Absolute frequency", "Relative frequency"])
 df.index.name = "Most common words"
 
 return df

Step 4: Working on text

Now we are going to analyze the individual text using the above two functions as well. We will pass both text files into the total_token() function. After getting the list of respective individual tokens and their counts, We will pass both into the make_df() function to get our resultant data frame.

#for text1
text1_counter, text1_size = get_text_counter(text1)
make_df(text1_counter.most_common(10), text1.size)

The above code snippet uses the previous functions to create a data frame for each token and its frequencies. Let us do the same for text2.

#for text2
text2_counter, text2_size = get_text_counter(text2)
make_df(text2_counter.most_common(10), text2_size)

The above code snippet will give the result as below.

Let us find the most common words across the two documents and print the word frequency differences for all those most_common_words.

all_counter = text1_counter + text2_counter
all_df = make_df(all_counter.most_common(1000), 1)
x = all_df.index.values

#creating our new list for dataframe as df_data[] comprising of 
#text1 relative frequency as text1_c,	
#text2 relative frequency as text2_c, and	
#Relative frequency difference as difference for both text files
df_data = []
for word in x:
 #getting relative frequency for each word in text1 and text2 and loading the same into text1_C and text2_c respectively
 text1_c = text1_counter.get(word, 0) / text1_size
 text2_c = text2_counter.get(word, 0) / text2_size
 
 #calculating difference between text1_c and text2_c & getting mod for all(in case of negative difference value)
 difference = abs(text1_c - text2_c)

 #appending above three columns into the list
 df_data.append([text1_c, text2_c, difference])

#creating dataframe dist_df and loading above list into the same
dist_df = pd.DataFrame(data=df_data, index=x, columns=["text1 relative frequency", "text2 relative frequency","Relative frequency difference" ])
dist_df.index.name = "Most common words"
dist_df.sort_values("Relative frequency difference", ascending=False, inplace=True)

#printing our required result
dist_df.head(10)

The above code snippet will give the output as follows.

Step 5: Save the file as CSV

dist_df.to_csv("output.csv")

The above syntax will create a file named output.csv and load our output into this file. like this way, We saved our required result.

Conclusion

In this article, we covered a small part of text mining. We learned about some new modules and their methods, and then put them to use. We hope you enjoyed our tutorial today and look forward to learning more exciting topics in the future.