POS Tagging in NLP using Spacy

Pos Tagging Cover Image

Parts of Speech (POS) are the words that perform different roles in a sentence. The English language has 8 parts of speech. They are:

  1. Nouns
  2. Pronouns
  3. Verbs
  4. Adverbs
  5. Adjectives
  6. Prepositions
  7. Conjunctions
  8. Interjunctions

A PoS tag provides a considerable amount of information about a word and its neighbours. It can be used in various tasks such as sentiment analysis, text to speech conversion, etc.

Also read: An Introduction to NLP


What is POS tagging?

Tagging means the classification of tokens into predefined classes. Parts of speech (POS) tagging is the process of marking each word in the given corpus with a suitable token i.e. part of speech based on the context. It is also known as grammatical tagging.


Techniques for POS tagging

There are mainly four types of POS taggers:

  1. Rule-based taggers: The rule-based taggers work on the basis of some pre-defined rules and the context of the information provided to them to assign a part of speech to a word.
  2. Stochastic/Probabilistic taggers: This is the simplest approach for POS tagging. It uses probability, frequency and statistics. These taggers find the tag which was most frequently used for a given word in the text under consideration in the training data and assign that tag to the word in the test data. Sometimes, this may result in tagging which is grammatically incorrect.
  3. Memory-based taggers: A collection of cases is kept in memory, each having a word, its context, and an appropriate tag. Based on the best match among the cases kept in memory, a new sentence is tagged.
  4. Transformation-based taggers: It is a combination of rule-based and stochastic tagging. In this type, the rules are automatically generated from the data. Also, some pre-defined rules are considered as well. Both these factors are used to perform POS tagging in transformation-based POS taggers.

Now, let’s try to implement POS tagging in Python. We’ll be seeing how to perform POS tagging using spacy library available in Python.

Consider the below text to be our corpus for the purpose of performing POS tagging.

Sarah lives in a hut in the village.
She has an apple tree in her backyard.
The apples are red in colour.

Let’s create a dataframe of the above sentences.

import pandas as pd 

text = ['Sarah lives in a hut in the village.', 
      'She has an apple tree in her backyard.', 
      'The apples are red in colour.']

df = pd.DataFrame(text, columns=['Sentence'])

df
POS Tagging Dataframe
POS Tagging Dataframe

POS tagging using spacy

Let’s get started with POS tagging using Python spacy.

import spacy

#load the small English model
nlp = spacy.load("en_core_web_sm")

#list to store the tokens and pos tags 
token = []
pos = []

for sent in nlp.pipe(df['Sentence']):
    if sent.has_annotation('DEP'):
        #add the tokens present in the sentence to the token list
        token.append([word.text for word in sent])
        #add the pos tage for each token to the pos list
        pos.append([word.pos_ for word in sent])

To know more about the spacy models, refer to this link.

In the above code,

  • We have first imported the spacy library.
  • Created two lists – one for storing the tokens (token) and another (pos) for storing the parts of speech tag.
  • Then, we loop over each sentence in the ‘Sentence’ column of the dataframe (df) and
    • Add the token from each sentence to the token list.
    • Add the token’s corresponding part of speech tag to the pos list.

The token list looks like this:

[['Sarah', 'lives', 'in', 'a', 'hut', 'in', 'the', 'village', '.'],
 ['She', 'has', 'an', 'apple', 'tree', 'in', 'her', 'backyard', '.'],
 ['The', 'apples', 'are', 'red', 'in', 'colour', '.']]

And the pos list is:

[['PROPN', 'VERB', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT'],
 ['PRON', 'VERB', 'DET', 'NOUN', 'NOUN', 'ADP', 'PRON', 'NOUN', 'PUNCT'],
 ['DET', 'NOUN', 'AUX', 'ADJ', 'ADP', 'NOUN', 'PUNCT']]

If we want to see the tokens and POS tag in a sentence alongside it, we can write

df['token'] = token 
df['pos'] = pos

It will result in this dataframe:

POS And Token Df
POS And Token Dataframe

If we want to know the count of a particular POS tag, we shall use the count method.

# counting the number of a specific pos tag in each sentence 
# (in the 'pos' col) and adding a new col for it in the df 
df['noun'] = df.apply(lambda x: x['pos'].count('NOUN'), axis=1)
df['verb'] = df.apply(lambda x: x['pos'].count('VERB'), axis=1)
df['adj'] = df.apply(lambda x: x['pos'].count('ADJ'), axis=1)
df['punct'] = df.apply(lambda x: x['pos'].count('PUNCT'), axis=1)

df

The above code uses a lambda function that counts the number of occurrences of a given POS tag in the mentioned column.
We can count other POS tags as well as per requirement.

The output for this is:

POS Count Df
POS Count Dataframe

Conclusion

That’s all! We have learnt about Parts of Speech tagging and its implementation using Spacy in this tutorial. Please feel free to check out more of our Python tutorials on our website!


References