Tokenization in Python using NLTK

Let’s learn to implement tokenization in Python using the NLTK library. As humans, we heavily depend on language to communicate with one another. Artificial Intelligence also requires computers to understand our language.

Making computer understand and process a language comes under Natural Language Processing (NLP). NLP is broadly defined as the automatic manipulation of a natural language like speech and text, by software.

Tokenization is a common task performed under NLP. Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens form the building block of NLP.

Why do we need tokenization?

Deep learning architectures in NLP such as LSTM and RNN process text in the form of tokens.

By running tokenization on a corpus of text we can form a vocabulary. These tokens are then represented in a manner that is suitable for the corresponding language model.

This representation is referred to as Word embeddings. Most commonly seen word embedding models are Skipgram and One-Hot-Encoding.

In this tutorial we will learn how to tokenize our text.

Let’s write some python code to tokenize a paragraph of text.

Implementing Tokenization in Python with NLTK

We will be using NLTK module to tokenize out text. NLTK is short for Natural Language ToolKit. It is a library written in Python for symbolic and statistical Natural Language Processing.

NLTK makes it very easy to work on and process text data. Let’s start by installing NLTK.

1. Installing NLTK Library

Run the pip command on your console to install NLTK.

pip install nltk

To install components of NLTK use:

import nltk
nltk.download()

In this tutorial we will be going over two types of tokenization :

Sentence tokenization
Word tokenization

2. Setting up Tokenization in Python

Let’s start by importing the necessary modules.

from nltk.tokenize import sent_tokenize, word_tokenize

sent_tokenize is responsible for tokenizing based on sentences and word_tokenize is responsible for tokenizing based on words.

The text we will be tokenizing is:

"Hello there! Welcome to this tutorial on tokenizing. After going through this tutorial you will be able to tokenize your text. Tokenizing is an important concept under NLP. Happy learning!"

Store the text in a variable.

text = "Hello there! Welcome to this tutorial on tokenizing. After going through this tutorial you will be able to tokenize your text. Tokenizing is an important concept under NLP. Happy learning!"

3. Sentence Tokenization in Python using sent_tokenize()

To tokenize according to sentences use:

print(sent_tokenize(text))

The output we get is:

['Hello there!', 'Welcome to this tutorial on tokenizing.', 'After going through this tutorial you will be able to tokenize your text.', 'Tokenizing is an important concept under NLP.', 'Happy learning!']

It returns a list with each element of the list as a sentence from the text.

4. Word Tokenization in Python using word_tokenize()

To tokenize according to words we use :

print(word_tokenize(text))

The output we get is :

['Hello', 'there', '!', 'Welcome', 'to', 'this', 'tutorial', 'on', 'tokenizing', '.', 'After', 'going', 'through', 'this', 'tutorial', 'you', 'will', 'be', 'able', 'to', 'tokenize', 'your', 'text', '.', 'Tokenizing', 'is', 'an', 'important', 'conceot', 'under', 'NLP', '.', 'Happy', 'learning', '!']

It returns a list with each element of the list as a word from the text. These can now go as tokens into a language model for training.

Complete Python code for tokenization using NLTK

The complete code is as follows :

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hello there! Welcome to this tutorial on tokenizing. After going through this tutorial you will be able to tokenize your text. Tokenizing is an important concept under NLP. Happy learning!"

print(sent_tokenize(text))
print(word_tokenize(text))