Hello readers, in this article we will try to understand a module called PUNKT available in the NLTK. NLTK (Natural Language Toolkit) is used in Python to implement programs under the domain of Natural Language Processing. It contains a variety of libraries for various purposes like text classification, parsing, stemming, tokenizing, etc.
Also read: Tokenization in Python using NLTK
What is PunktSentenceTokenizer?
In NLTK, PUNKT is an unsupervised trainable model, which means it can be trained on unlabeled data (Data that has not been tagged with information identifying its characteristics, properties, or categories is referred to as unlabeled data.)
It generates a list of sentences from a text by developing a model for words that start sentences, prepositional phrases, and abbreviations using an unsupervised technique. Without first being put to use, it has to be trained on a sizable amount of plaintext in the intended language.
Where to use PunktSentenceTokenizer?
While working on any project under the natural language processing domain, nltk is the most vital module used. Now, nltk does have an extensive range of functions, but sometimes to increase the efficiency and to verify the outputs are accurate and the developed model is considering all case scenarios, we need to import a few extra modules.
For example, splitting a long text into sentences, the following is the provided input text and the task to separate the input into different sentences.
We met Miss. Tanaya Das and Mr.Rohan Singh today. They are pursuing a B.tech degree in Data Science.
Before starting with the program as a part of the prerequisites, always remember to import the nltk module and download the punkt package to avoid errors. Below is the code to import the same.
import nltk nltk.download('punkt')
['We met Miss.', 'Tanaya Das and Mr.Rohan Singh today.', 'They are pursuing a B.tech degree in Data Science.']
The output of the code is relevant but not completely correct. Here the punkt package has succeeded in identifying the abbreviation “Mr.” but fails to recognize that the period after the abbreviation “Miss” is not the end of the sentence.
The main advantage of this package as discussed before is that it used an unsupervised algorithm, which means that one can train the model and hence make the overall code more accurate.
Training the punkt tokenizer on a corpus
Let’s try to train the punkt sentence tokenizer. For training first, we need to define a corpus. (A corpus comprising text and speech data used for natural language processing can be utilized to train AI and machine learning systems.)
corpus = """ The word miss has multiple meanings thats the reason why its tricky for nlp to recognize it as a abbrevation.Miss. means to fail to hit something, to fail to meet something, or to feel sadness over the absence or loss of something. The word miss. has several other senses as a verb and a noun. To miss. something is to fail to hit or strike something, as with an arrow miss. a target. If a runaway vehicle miss. a stop sign, then it doesn’t smash into it. Real-life examples: If you throw a basketball to your friend and they don’t catch it, the ball miss. When a baseball player miss. a baseball with their bat, they try to hit the ball with the bat but fail to. A bowling ball that doesn’t knock down any pins has miss. them. """
Once we define a relevant corpus, we further use
punktTrainer() – learns parameters used in Punkt sentence boundary detection, after that, we use the “train” function to gather learning information from a given text. It will select all of the parameters for sentence boundary detection if finalize is set to True. In the event that neither get params() nor finalize training() is executed, this will be postponed. Abbreviations detected will be listed if verbose is True.
Syntax: train(train_text, verbose=False)
Abbreviation: [2.0326] miss We met Miss. Tanaya Das and Mr.Rohan Singh today. They are pursuing a B.tech degree in Data Science.
In this way, we have successfully trained the model to identify the word “Miss” and not to misinterpret the period after it as the end of a sentence. Similarly, we can define corpus and train an unsupervised model to learn other abbreviations, acronyms, etc. This is possible using the
Natural Language Processing is a vast domain under artificial intelligence to understand the structure and meaning of human language. In python, we use nltk ( natural language toolkit) for its implementation. punkt is one of the modules in nltk. Punkt is made to learn parameters from a corpus in an unsupervised way that is related to the target domain, such as a list of abbreviations, acronyms, etc.