3 Approaches to Removing the Duplicate Words From a Text

HOW TO REMOVE DUPLICATE WORDS FROM A TEXT

Suppose you are working with a load of text from a different language or even English that is obtained from different sources and may have many duplicate words in the sentences. You just need the summary of the text before proceeding to do anything with it. What do you do?

Text summarization is the major application of Natural Language Processing in which the gist of a text or document, sometimes a novel, is obtained by using the nltk library.

Text pre-processing occurs as the predecessor of many NLP tasks (also before text summarization). Text preprocessing often has many steps, like removing the stop words, filtering the duplicate words, and so on.

As a first step, refer this article to learn how to remove stop words

There are many in-built methods and libraries in Python we can leverage to remove duplicate words from a text or a string. This tutorial focuses on four different approaches to doing so.

Remove Duplicate Words Using Counter

The counter method is a great tool often used in competitive coding to count the number of elements present in a data structure. The counter method only considers the first occurrence of an object, and we are going to leverage just that.

from collections import Counter
def remdup(input):
    input = input.split(" ")
    unqwords = Counter(input)
    s = " ".join(unqwords.keys())
    print("The unique words are:")
    print(list(unqwords.keys()))
    print("Text with no duplicate words:")
    return s
inptext = "Python is a great language to dive into programming for absolute beginners. Java is also a great programming language which is also popular right now. C# is another high-level programming language supporting multiple paradigms"
newtext = remdup(inptext)
print(newtext)

We are importing the Counter method from the Collections module in the very first line. A function is initialized in the second line with the help of def keyword takes a string input as an argument.

The string we supply as input is divided into individual words with the help of split. The MVP of this code is used in the fourth line. All the unique words are stored in a variable called unqwords. A separate string called s is used to store the text with no duplicates.

The text is stored in a variable called inptext and the new string with no duplicates is stored in newtext. The list of unique words and the new text can be seen in the output.

Removing Duplicate Words using Counter
Removing Duplicate Words using Counter

Remove Duplicate Words Using Set()

You might be wondering, What does the data type set have to do with duplicate words? If you don’t already know, sets are certain data types in Python that do not allow duplicate words. Even if we do include a duplicate item, the first occurrence is considered leaving the duplicate item.

The same process is followed here as in the previous example.

def remove_dup(text):
    words = text.split()
    unq_words = []
    seen_words = set()
    for word in words:
      if word not in seen_words:
        seen_words.add(word)
        unq_words.append(word)
    return ' '.join(unq_words)
text = "Python is a great language to dive into programming for absolute beginners.Java is also a great programming language which is also popular right now."
res = remove_dup(text)
print(res)

We initialize two lists for storing unique words and previously encountered words in the third and fourth lines. The logic is simple, a for loop is iterated on each word of the input string and checked to see if it has already been visited. If it is not, it is unique, right? Such unique words are appended (added at the end) to the unique word list and are then stored in the visited words’ list. By doing so, we are making sure that if the same word is encountered, it is discarded from the text.

Removing Duplicate Words using Set
Removing Duplicate Words using Set

Using NLTK library

Let’s try something new this time. Let us do the same thing with the help of the nltk library. We can consider this approach as an extension to the previous one, as we are going to use sets here, but the difference is how we treat the input string using nltk.

Before we do this, we need to first make sure to install the nltk library.

pip install nltk 
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
def rem_dup(text):
    words = word_tokenize(text)
    unq = set()
    res = []
    for word in words:
      if word not in unq:
        unq.add(word)
        res.append(word)
    return ' '.join(res)

inptext = "Python is a great language to dive into programming for absolute beginners.Java is also a great programming language which is also popular right now."
newtext = rem_dup(inptext)
print(newtext)

The punkt object is used to divide the text into sentences or words. In this code, we are using the word tokenizer, which divides the piece of text into words.

Learn more about punkt here in detail

These sets of words go through a similar process to what we observed in the second approach.

Using nltk to remove duplicate words
Using nltk to remove duplicate words

Conclusion

I hope this article has been helpful in demonstrating a few of the many ways to remove duplicate words from a piece of text. To briefly recap, we have seen examples for removing duplicates from a string using the counter module, leveraging the set data type, and using the nltk library.

Please feel free to test these methods out with your own text pieces!

References

Find more about the collections module here

Sets documentation

NLTK PUNKT