Today, we are going to learn a very exciting topic which is text mining in Python. We are going to learn some important modules, and some important methods as well. Before getting deeper into let’s have a quick look at our table of content.
What is Text Mining in Python?
Before getting started let’s understand what text mining really is.
Text mining is the process of extracting information from text data. It involves a variety of tasks such as text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and context-related modeling. It uses information retrieval, lexical analysis, and study of word frequency and pattern recognition techniques to find links and associations. It also uses visualization and natural language processing algorithms and analytical methods.
Advantages of Text Mining
- It structures unstructured data for further analysis.
- It reduces the risks of false insights.
- It creates informative decisions, automotive processes, and research analysis.
Applications of Text Mining
- Information Extraction: Identification and extraction of relevant facts from the unstructured data.
- Document classifications and clustering: It aims for grouping and categorizes terms by classifications and clustering methods.
- Information Retrieval: It aims at stringifying the retrieved text in a text document.
- Natural Language Processing: It is a major part of text mining. Here we use different computational tasks for understanding and analyzing the unstructured data from a text file.
Implementing Text Mining in Python
Today we are going to have our text mining and we will try a simple text mining problem on our computer. Let us enter our code snippets without getting late. We are going to use our google collab. We will move step-by-step and complete our requirements.
Step 1: Importing modules
We need to import some modules in order to do our work.
- Codecs module: It is used to implement a class for file encoding and decoding, and provides access to manage errors in the lookup process.
- Collection module: It is used as an alternative for containers like dicts, lists, tuples, and Sets.
- NLTK (Natural Language Toolkit): It is a library used for natural language processing.
- Matplotlib library: It is used for the visualization of our data (It may be a graph or chart).
- NumPy library: It is used for working on arrays. It provides some inbuilt methods to work on supported by the NumPy library.
- Pandas library: It is used for data manipulation and analysis in the python programming language.
- nltk.stem: It is a sub-package of the nltk module that removes morphological affixes(extra product) for a string or contents. We are importing the porter stemmer algorithm from this subpackage to get our stemming done.
- nltk.tokenize: It helps us to split the text into tokens.
The last two packages are used to calculate the number of tokens in text files. We will create a function
total_tokens() We will use these all in the same function.
import codecs import collections import numpy as np import pandas as pd import nltk from nltk.stem import PorterStemmer from nltk.tokenize import WordPunctTokenizer import matplotlib
In case any error occurs while importing our modules you can install them on your command prompt using the pip installer as follows.
!pip install nltk !pip install pandas !pip install numpy !pip install matplotlib
Step 2: Reading text file
The codecs modules provide codecs.open() method to read and write Unicode-encoded files. This is useful for reading files that include characters from many different languages.
with codecs.open("/content/text1.txt", "r", encoding="utf-8") as f: text1=f.read() with codecs.open("/content/text2.txt", "r", encoding="utf-8") as f: text2=f.read()
Step 3: Creating Functions
We are using the WordPunctTokenizer().tokenize() method to count the total number of tokens in our text file. This will help us to work on our data file more easily.
We are using the collection.counter() method to store each individual token as keys in a dictionary, and their count as the corresponding values. This way, after implementation, the function below will return a list of dictionaries, as well as the total number of tokens.
def total_tokens(text): n = WordPunctTokenizer().tokenize(text) return collections.Counter(n), len(n)
We can create a function to calculate the relative and absolute frequency for the most common words in our text file. By doing this, we can create a dataframe that will return the results.
#function to create relative frequency and absolute frequency for most common words def make_df(counter, size): #getting absolute and relative frequencies for each tokens in the counter list absolute_frequency = np.array([el for el in counter]) relative_frequency = absolute_frequency / size #creating a data frame using obtained data above(absolute_frequency & relative_frequency) df = pd.DataFrame(data=np.array([absolute_frequency, relative_frequency]).T, index = [el for el in counter], columns=["Absolute frequency", "Relative frequency"]) df.index.name = "Most common words" return df
Step 4: Working on text
Now we are going to analyze the individual text using the above two functions as well. We will pass both text files into the
total_token() function. After getting the list of respective individual tokens and their counts, We will pass both into the
make_df() function to get our resultant data frame.
#for text1 text1_counter, text1_size = get_text_counter(text1) make_df(text1_counter.most_common(10), text1.size)
The above code snippet uses the previous functions to create a data frame for each token and its frequencies. Let us do the same for text2.
#for text2 text2_counter, text2_size = get_text_counter(text2) make_df(text2_counter.most_common(10), text2_size)
The above code snippet will give the result as below.
Let us find the most common words across the two documents and print the word frequency differences for all those most_common_words.
all_counter = text1_counter + text2_counter all_df = make_df(all_counter.most_common(1000), 1) x = all_df.index.values #creating our new list for dataframe as df_data comprising of #text1 relative frequency as text1_c, #text2 relative frequency as text2_c, and #Relative frequency difference as difference for both text files df_data =  for word in x: #getting relative frequency for each word in text1 and text2 and loading the same into text1_C and text2_c respectively text1_c = text1_counter.get(word, 0) / text1_size text2_c = text2_counter.get(word, 0) / text2_size #calculating difference between text1_c and text2_c & getting mod for all(in case of negative difference value) difference = abs(text1_c - text2_c) #appending above three columns into the list df_data.append([text1_c, text2_c, difference]) #creating dataframe dist_df and loading above list into the same dist_df = pd.DataFrame(data=df_data, index=x, columns=["text1 relative frequency", "text2 relative frequency","Relative frequency difference" ]) dist_df.index.name = "Most common words" dist_df.sort_values("Relative frequency difference", ascending=False, inplace=True) #printing our required result dist_df.head(10)
The above code snippet will give the output as follows.
Step 5: Save the file as CSV
The above syntax will create a file named output.csv and load our output into this file. like this way, We saved our required result.
In this article, we covered a small part of text mining. We learned about some new modules and their methods, and then put them to use. We hope you enjoyed our tutorial today and look forward to learning more exciting topics in the future.