Creating a TF-IDF Model from Scratch in Python

The TF-IDF model is a method to represent words in numerical values. “Hello there, how have you been?”, you can easily understand what I am trying to ask you but computers are good with numbers and not with words.

In order for a computer to make sense of the sentences and words, we represent these sentences using numbers while hoping to preserve the context and meaning.

TF-IDF model is one such method to represent words in numerical values. TF-IDF stands for “Term Frequency – Inverse Document Frequency”.

This method removes the drawbacks faced by the bag of words model. it does not assign equal value to all the words, hence important words that occur a few times will be assigned high weights.

In this article we will create TF-IDF representation of some sample text corpus step by step from scratch.

An Introduction to TF-IDF

TF-IDF is the product of Term Frequency and Inverse Document Frequency. Here’s the formula for TF-IDF calculation.

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

What are Term Frequency and Inverse Document Frequency you ask? let’s see what they actually are.

What is Term Frequency?

It is the measure of the frequency of words in a document. It is the ratio of the number of times the word appears in a document compared to the total number of words in that document.

tf(t,d) = count of t in d / number of words in d

What is Inverse Document Frequency?

The words that occur rarely in the corpus have a high IDF score. It is the log of the ratio of the number of documents to the number of documents containing the word.

We take log of this ratio because when the corpus becomes large IDF values can get large causing it to explode hence taking log will dampen this effect.

we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

idf(t) = log(N/(df + 1))

Step by Step Implementation of the TF-IDF Model

Let’s get right to the implementation part of the TF-IDF Model in Python.

1. Preprocess the data

We’ll start with preprocessing the text data, and make a vocabulary set of the words in our training data and assign a unique index for each word in the set.

#Importing required module
import numpy as np
from nltk.tokenize import  word_tokenize 

#Example text corpus for our tutorial
text = ['Topic sentences are similar to mini thesis statements.\
        Like a thesis statement, a topic sentence has a specific \
        main point. Whereas the thesis is the main point of the essay',\
        'the topic sentence is the main point of the paragraph.\
        Like the thesis statement, a topic sentence has a unifying function. \
        But a thesis statement or topic sentence alone doesn’t guarantee unity.', \
        'An essay is unified if all the paragraphs relate to the thesis,\
        whereas a paragraph is unified if all the sentences relate to the topic sentence.']

#Preprocessing the text data
sentences = []
word_set = []

for sent in text:
    x = [i.lower() for  i in word_tokenize(sent) if i.isalpha()]
    sentences.append(x)
    for word in x:
        if word not in word_set:
            word_set.append(word)

#Set of vocab 
word_set = set(word_set)
#Total documents in our corpus
total_documents = len(sentences)

#Creating an index for each word in our vocab.
index_dict = {} #Dictionary to store index for each word
i = 0
for word in word_set:
    index_dict[word] = i
    i += 1

2. Create a dictionary for keeping count

We then create a dictionary to keep the count of the number of documents containing the given word.

#Create a count dictionary

def count_dict(sentences):
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count

word_count = count_dict(sentences)

3. Define a function to calculate Term Frequency

Now, let’s define a function to count the term frequency (TF) first.

#Term Frequency
def termfreq(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    return occurance/N

4. Define a function calculate Inverse Document Frequency

Now, with the term frequency function set, let’s define another function for the Inverse Document Frequency (IDF)

#Inverse Document Frequency

def inverse_doc_freq(word):
    try:
        word_occurance = word_count[word] + 1
    except:
        word_occurance = 1 
    return np.log(total_documents/word_occurance)

5. Combining the TF-IDF functions

Let’s create another function to combine both the TF and IDF functions from above to give us our desired output for the TF-IDF model.

def tf_idf(sentence):
    tf_idf_vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = termfreq(sentence,word)
        idf = inverse_doc_freq(word)
        
        value = tf*idf
        tf_idf_vec[index_dict[word]] = value 
    return tf_idf_vec

6. Apply the TF-IDF Model to our text

The implementation of the TF-IDF model in Python is complete. Now, let’s pass the text corpus to the function and see what the output vector looks like.

#TF-IDF Encoded text corpus
vectors = []
for sent in sentences:
    vec = tf_idf(sent)
    vectors.append(vec)

print(vectors[0])

TF-IDF Model Encoded Vector — TF-IDF Encoded Vector

Now, if the model encounters an unknown word other than the vocab, it will give us a Key error as we did not account for any unknown tokens.

The purpose of this article is to demonstrate how TF-IDF actually works under the hood.

You can find the notebook for this tutorial on my GitHub repository here.

Feel free to implement and modify the code using a new and more versatile text corpus.

Conclusion

In this article, we implemented a TF-IDF model from scratch in Python. We also focused on understanding some theory behind the model and finally encoded our own sentences using functions we created.

Happy Learning!