Text Mining in Python – A Complete Guide

Text mining is the process of turning raw, unstructured text into structured data you can count, analyze, and visualize. In Python, a handful of libraries handles the full pipeline — loading text, cleaning it, tokenizing words, removing common filler words, and producing frequency charts that reveal what a document is actually about.

This guide walks through each step of the pipeline from scratch. No machine learning required — just Python, NLTK, Pandas, and Matplotlib.

What is Text Mining?

Text mining, also called text analytics, extracts meaningful information from natural language by counting word frequencies after cleaning and normalizing the text. The core technique is frequency analysis — which words appear most often, which words are unique to a document, and how the vocabulary changes between documents.

The typical pipeline has five stages:

Acquire — load text from a file, URL, or database
Clean — remove punctuation, digits, and extra whitespace
Tokenize — split text into individual words
Filter — remove stopwords, apply stemming
Analyze — count frequencies, compare documents, visualize

Installing the Libraries

You need three libraries for this entire guide. NLTK handles tokenization, stopwords, and stemming. Pandas manages the frequency data in DataFrames. Matplotlib produces the charts.


pip install nltk pandas matplotlib

After installing NLTK, download the datasets it needs:


import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Step 1 — Load and Clean Text

Raw text contains punctuation, digits, and capitalization that inflate word counts with noise. The clean step normalizes everything to lowercase and strips anything that is not a letter or space.


import re

def load_text(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        return f.read()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

raw = load_text('article.txt')
cleaned = clean_text(raw)
print(f"Original: {len(raw)} chars | Cleaned: {len(cleaned)} chars")

Step 2 — Tokenize and Remove Stopwords

Tokenization splits the cleaned text into individual words. Stopwords are common words like the, is, and and that appear so frequently they overwhelm the actual signal. Removing them before counting gives you meaningful frequency data.


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def tokenize_and_filter(text):
    tokens = word_tokenize(text)
    return [w for w in tokens if w not in stop_words]

words = tokenize_and_filter(cleaned)
print(f"Total tokens after filtering: {len(words)}")
print(f"Sample: {words[:15]}")

Step 3 — Apply Stemming

Stemming reduces words to their root form by lopping off suffixes. Running becomes run, documents becomes document. This groups variations together so your frequency counts are not split across inflected forms.


from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in words]
print(f"Stemmed sample: {stemmed[:15]}")

The Porter Stemmer is the standard default — fast and deterministic. For more aggressive stemming across multiple languages, try the Snowball Stemmer.

Step 4 — Build a Frequency Distribution

NLTK’s FreqDist turns a list of tokens into a dictionary of word counts sorted by frequency. This is the core output of any text mining pipeline.


from nltk import FreqDist

freq = FreqDist(stemmed)
print(f"Unique words: {len(freq)}")
print(f"Top 10: {freq.most_common(10)}")

Step 5 — Save Results to CSV

Export the frequency data to a CSV file using Pandas. Each row is a word with its count and relative frequency as a percentage of total tokens.


import pandas as pd

total = sum(freq.values())
df = pd.DataFrame(freq.most_common(), columns=['word', 'count'])
df['relative_freq'] = (df['count'] / total * 100).round(2)
df.to_csv('word_frequencies.csv', index=False)
print(df.head(10))

Step 6 — Visualize Frequency Distributions

Matplotlib plots make frequency distributions easy to interpret. A bar chart of the top 20 words gives an instant picture of what a document is about.


import matplotlib.pyplot as plt

top_20 = freq.most_common(20)
words_list, counts = zip(*top_20)

plt.figure(figsize=(12, 6))
plt.bar(words_list, counts, color='steelblue')
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.title('Top 20 Words in Document')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Comparing Two Documents

To compare two documents, run the full pipeline on each and compute relative frequencies. Words with the largest frequency difference between documents are the most distinctive to each.


def mine_text(filepath):
    raw = load_text(filepath)
    cleaned = clean_text(raw)
    words = tokenize_and_filter(cleaned)
    stemmed = [stemmer.stem(w) for w in words]
    freq = FreqDist(stemmed)
    total = sum(freq.values())
    # Relative frequency as percentage
    return {w: (c / total * 100) for w, c in freq.items()}

freq_a = mine_text('doc_a.txt')
freq_b = mine_text('doc_b.txt')

all_words = set(freq_a.keys()) | set(freq_b.keys())
diff = {w: abs(freq_a.get(w, 0) - freq_b.get(w, 0)) for w in all_words}
distinctive = sorted(diff.items(), key=lambda x: x[1], reverse=True)[:10]
print("Most distinctive words:", distinctive)

Complete Pipeline in One Script

Here is the full text mining pipeline assembled into a single working script. Copy it into a .py file, drop your text file alongside it, and run it.


import re
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk import FreqDist
import nltk

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

def load_text(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        return f.read()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

def tokenize_and_filter(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    return [w for w in tokens if w not in stop_words]

def stem_words(words):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in words]

def plot_top_words(freq, n=20):
    top = freq.most_common(n)
    words, counts = zip(*top)
    plt.figure(figsize=(12, 6))
    plt.bar(words, counts, color='steelblue')
    plt.xlabel('Word')
    plt.ylabel('Frequency')
    plt.title(f'Top {n} Words')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

# Run pipeline
filepath = 'article.txt'
raw = load_text(filepath)
cleaned = clean_text(raw)
tokens = tokenize_and_filter(cleaned)
stemmed = stem_words(tokens)
freq = FreqDist(stemmed)

# Save to CSV
df = pd.DataFrame(freq.most_common(), columns=['word', 'count'])
df['relative_freq'] = (df['count'] / sum(freq.values()) * 100).round(2)
df.to_csv('word_frequencies.csv', index=False)

# Visualize
plot_top_words(freq)

print(f"Unique words: {len(freq)}")
print(f"Top 5: {freq.most_common(5)}")

Summary

The text mining pipeline in Python follows five steps: load, clean, tokenize, filter, and count. NLTK provides the linguistic tools — tokenization, stopword lists, and stemming — while Pandas and Matplotlib handle the data analysis and visualization.

Load raw text and strip punctuation, digits, and extra whitespace
Tokenize with NLTK’s word_tokenize and remove English stopwords
Apply Porter Stemmer to reduce words to their root forms
Use FreqDist for frequency counts and Pandas to export CSV data
Matplotlib visualizes word distributions as bar charts
The full pipeline — load, clean, tokenize, filter, count, visualize — fits in under 80 lines of Python

Frequently Asked Questions

What is the difference between tokenization and stemming?

Tokenization splits text into individual units called tokens, typically words. Stemming reduces those tokens to their root form by removing suffixes. Tokenization always comes first, and both are standard steps in any text mining pipeline.

Which stemmer should I use?

The Porter Stemmer is fast and deterministic, making it a good default. The Snowball Stemmer is a refined version that handles more edge cases and supports multiple languages. For most English text mining tasks, Porter is sufficient.

How do I handle non-English text?

NLTK provides stopword lists for many languages. Pass the language name to stopwords.words('spanish') or stopwords.words('french') instead of 'english'. You may also need a language-specific stemmer.

Why should I use relative frequency instead of absolute counts?

Absolute counts depend on document length. A 5,000-word document will naturally have higher counts than a 500-word document even if the relative proportions are the same. Relative frequency (percentage of total tokens) lets you compare documents of different lengths on equal footing.

Can text mining work on a single short document?

Yes. For a focused document on a specific topic, even a few hundred words can reveal the main themes through frequency analysis. For statistical reliability in classification or clustering tasks across many documents, hundreds to thousands of documents are more typical.