Word Cloud using Python

Word Cloud Using Python

Welcome to this tutorial on word cloud using Python. The word cloud technique has been a trending technique of data visualization, especially where textual data is present.

Hence, we can say that Word Cloud has been one of the prominent techniques for data visualization using Natural Language Processing (NLP).

What is a Word Cloud?

We extract the most frequently used words in the article and then based on the number of times a word is used.

Greater the usage, greater the size of the word in the word cloud.

How to Create a Word Cloud using Python?

So, lets begin with creating our own word cloud using Python.

1. Install the wordcloud and Wikipedia libraries

To create a word cloud, we need to have python 3.x on our machines and also wordcloud installed. To install wordcloud, you can use the pip command:

sudo pip install wordcloud

For this example, I will be using a webpage from Wikipedia namely – Python (programming language). To use Wikipedia contents, we need to install the wikipedia dependencies.

sudo pip install wikipedia

2. Search Wikipedia based on a query

First, we will import the wikipedia library using the code snippet below:

import wikipedia

We will use the search function and only take the first element out of it, this is why we use [0]. This will be the title of our page.

def get_wiki(query):
	title = wikipedia.search(query)[0]

	# get wikipedia page for selected title
	page = wikipedia.page(title)
	return page.content

After extracting the title, we use the page() and retrieve the contents of the page. After this we return only the content of the page using page.content.

If you run the above code on the console, you will get all the raw data from the site on the console. But our task does not end here, we need to make a word cloud.

Image 1
Getting the raw data from wikipedia

3. Create cloud mask and set stop words

To begin with we will import the wordcloud library and import specific packages such as WordCloud and STOPWORDS.

We import the STOPWORDS because we want to remove basic articles such as a,an,the and other common words used in the English Language.

from wordcloud import WordCloud, STOPWORDS

We will use the mask. This a rough diagram named as ‘cloud.png’ in the current working directory denoted by currdir. We will open this image and store it in a numpy array.

Image 2
cloud.png

Our next task is to define a set of stopwords and hence we use set(STOPWORDS).

We create the word cloud using a Python object using the WordCloud(). We will pass parameters such as background_color, max_words (here we choose our word limit as 200), mask and stopwords.

We will then use the wc.generate() and pass the raw text as a parameter.

We can also save the word cloud generated into a file and we will name it as output.png.

def create_wordcloud(text):
	mask = np.array(Image.open(path.join(currdir, "cloud.png")))
	
	stopwords = set(STOPWORDS)

	# create wordcloud object
	wc = WordCloud(background_color="white",
					max_words=200, 
					mask=mask,
	               	stopwords=stopwords)

	wc.generate(text)

	# save wordcloud
	wc.to_file(path.join(currdir, "output.png"))

Running these 2 functions may take upto 30-40 seconds the first time, and may reduce over further runs. The complete code and output image is as shown below in the next section.

Complete Implementation of Word Cloud using Python

import sys
from os import path
import numpy as np
from PIL import Image
import wikipedia
from wordcloud import WordCloud, STOPWORDS

currdir = path.dirname(__file__)

def get_wiki(query):
	title = wikipedia.search(query)[0]
	page = wikipedia.page(title)
	return page.content


def create_wordcloud(text):
	mask = np.array(Image.open(path.join(currdir, "cloud.png")))
	
	stopwords = set(STOPWORDS)

	wc = WordCloud(background_color="white",
					max_words=200, 
					mask=mask,
	               	stopwords=stopwords)
	
	wc.generate(text)
	wc.to_file(path.join(currdir, "output.png"))


if __name__ == "__main__":
	query = sys.argv[1]
	text = get_wiki(query)
	
	create_wordcloud(text)

Output:

Image 3
output.png

Conclusion

Creating a word cloud using Python is one of the easiest ways to visualize the maximum number of words used in any textual content. It makes it easy to understand the subject and topics discussed in the text by just running this code.

I hope you enjoyed this article. Do let us know your feedback in the comment section below.