Stemming and Lemmatization in Python

In the field of Natural language processing, Stemming and Lemmatization are the text normalization techniques used to prepare texts, documents for further analysis.

Understanding Stemming and Lemmatization

While working with language data we need to acknowledge the fact that words like ‘care’ and ‘caring’ have the same meaning but used in different forms of tenses. Here we make use of Stemming and Lemmatization to reduce the word to its base form.

In this article, we will perform Stemming and Lemmatization using the NLTK library and SpaCy libraries.

What is Stemming?

A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer. (Wikipedia)

Stemming is used to preprocess text data. The English language has many variations of a single word, so to reduce the ambiguity for a machine-learning algorithm to learn it’s essential to filter such words and reduce them to the base form.

NLTK provides classes to perform stemming on words. The most widely used stemming algorithms are PorterStemmer, SnowballStemmer etc.

Creating a Stemmer with PorterStemmer

Let’s try out the PorterStemmer to stem words.

#Importing required modules
from nltk.stem.porter import PorterStemmer

#Creating the class object
stemmer = PorterStemmer()

#words to stem
words = ['rain','raining','faith','faithful','are','is','care','caring']

#Stemming the words
for word in words:
    print(word+' -> '+ stemmer.stem(word))

Output:

rain --> rain
raining --> rain
faith --> faith
faithful --> faith
are --> are
is --> is
care --> care
caring --> care

The PorterStemmer class has .stem method which takes a word as an input argument and returns the word reduced to its root form.

Creating a Stemmer with Snowball Stemmer

It is also known as the Porter2 stemming algorithm as it tends to fix a few shortcomings in Porter Stemmer. Let’s see how to use it.

#Importing the class
from nltk.stem.snowball import SnowballStemmer

#words to stem
words = ['rain','raining','faith','faithful','are','is','care','caring']

#Creating the Class object
snow_stemmer = SnowballStemmer(language='english')

#Stemming the words
for word in words:
    print(word+' -> '+snow_stemmer.stem(word))

Output:

rain --> rain
raining --> rain
faith --> faith
faithful --> faith
are --> are
is --> is
care --> care
caring --> care

The outputs from both the stemmer look similar because we have used limited text corpus for the demonstration. Feel free to experiment with different words and compare the outputs of the two.

What is Lemmatization?

Lemmatization is the algorithmic process for finding the lemma of a word – it means unlike stemming which may result in incorrect word reduction, Lemmatization always reduces a word depending on its meaning.

It helps in returning the base or dictionary form of a word, which is known as the lemma.

At first Stemming and Lemmatization may look the same but they are actually very different in next section we will see the difference between them.

now let’s see how to perform Lemmatization on a text data.

Creating a Lemmatizer with Python Spacy

Note: python -m spacy download en_core_web_sm

The above line must be run in order to download the required file to perform lemmatization

#Importing required modules
import spacy

#Loading the Lemmatization dictionary
nlp = spacy.load('en_core_web_sm')

#Applying lemmatization
doc = nlp("Apples and oranges are similar. Boots and hippos aren't.")

#Getting the output
for token in doc:
    print(str(token) + ' --> '+ str(token.lemma_))

Output:

Apples --> apple
and --> and
oranges --> orange
are --> be
similar --> similar
. --> .
Boots --> boot
and --> and
hippos --> hippos
are --> be
n't --> not
. --> .

The above code returns an iterator of spacy.doc object type which is the Lemmatized form of the input words. We can access the lemmatized word using .lemma_ attribute.

See how it automatically tokenizes the sentence for us.

Creating a Lemmatizer with Python NLTK

NLTK uses wordnet. The NLTK Lemmatization method is based on WorldNet’s built-in morph function.

Let’s see how to use it.

import nltk
nltk.download('wordnet') #First download the required data

#Importing the module
from nltk.stem import WordNetLemmatizer 

#Create the class object
lemmatizer = WordNetLemmatizer()


# Define the sentence to be lemmatized
sentence = "Apples and oranges are similar. Boots and hippos aren't."

# Tokenize the sentence
word_list = nltk.word_tokenize(sentence)
print(word_list)


# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

Output:

['Apples', 'and', 'oranges', 'are', 'similar', '.', 'Boots', 'and', 'hippos', 'are', "n't", '.']
Apples and orange are similar . Boots and hippo are n't .

Lemmatization vs. Stemming

I get it. It may be confusing at first to choose between Stemming and Lemmatization but Lemmatization certainly is more effective than stemming.

We saw that both techniques reduce each word to its root. In stemming, this may just be a reduced form of the target word, whereas lemmatization, reduces to a true English language word root as lemmatization requires cross-referencing the target word within the WordNet corpus.

Stemming vs. Lemmatization? It is a question of tradeoff between speed and details. Stemming is usually faster than Lemmatization but it can be inaccurate. Whereas if we need our model to be as detailed and as accurate as possible, then lemmatization should be preferred.