In this tutorial, we will learn how to remove stop words from a piece of text in Python. Removing stop words from text comes under pre-processing of data before using machine learning models on it.
What are stop words?
Stop Words are words in the natural language that have very little meaning. These are words like ‘is’, ‘the’, ‘and.
While extracting information from text, these words don’t provide anything meaningful. Therefore it is a good practice to remove stop words from the text before using it to train machine learning models.
Another advantage of removing stop words is that it reduces the size of the dataset and the time taken in training of the model.
The practice of removing stop words is also common among search engines. Search engines like Google remove stop words from search queries to yield a quicker response.
In this tutorial, we will be using the NLTK module to remove stop words.
NLTK module is the most popular module when it comes to natural language processing.
To start we will first download the corpus with stop words from the NLTK module.
Download the corpus with stop words from NLTK
To download the corpus use :
import nltk nltk.download('stopwords')
Now we can start using the corpus.
Print the list of stop words from the corpus
Let’s print out the list of stop words from the corpus. To do that use:
from nltk.corpus import stopwords print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
This is the list of stop words for English language. There are other languages available too.
To print the list of languages available use :
from nltk.corpus import stopwords print(stopwords.fileids())
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
These are the languages for which stop words are available in the NLTK ‘stopwords‘ corpus.
How to add your own stop words to the corpus?
To add stop words of your own to the list use :
new_stopwords = stopwords.words('english') new_stopwords.append('SampleWord')
Now you can use ‘new_stopwords‘ as the new corpus. Let’s learn how to remove stop words from a sentence using this corpus.
How to remove stop words from the text?
In this section, we will learn how to remove stop words from a piece of text. Before we can move on, you should read this tutorial on tokenization.
Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens form the building block of NLP.
We will use tokenization to convert a sentence into a list of words. Then we will remove the stop words from that Python list.
nltk.download('punkt') from nltk.tokenize import word_tokenize text = "This is a sentence in English that contains the SampleWord" text_tokens = word_tokenize(text) remove_sw = [word for word in text_tokens if not word in stopwords.words()] print(remove_sw)
['This', 'sentence', 'English', 'contains', 'SampleWord']
You can see that the output contains ‘SampleWord‘ that is because we used the default corpus for removing stop words. Let’s use the corpus that we created. We’ll use list comprehension for the same.
nltk.download('punkt') from nltk.tokenize import word_tokenize text = "This is a sentence in English that contains the SampleWord" text_tokens = word_tokenize(text) remove_sw = [word for word in text_tokens if not word in new_stopwords] print(remove_sw)
['This', 'sentence', 'English', 'contains']
This tutorial was about removing stop words from the text in python. We used the NLTK module to remove stop words from the text. We hope you had fun learning with us!