BLEU score in Python – Beginners Overview

Bleu Score

Hello, readers! In this article, we will be focusing on the implementation of BLEU score in Python.

So, let us get started! 馃檪

Also read: Custom datasets in Python

What is BLEU score?

In the domain of Machine Learning modeling, deep learning, and natural language processing, we need certain error metrics that enable us to evaluate the built model over the string input.

BLEU score is one such metric that enables us to estimate the efficiency of the Machine Translation models or systems. Today, this has been widely used by Natural language processing models and applications altogether.

Behind the scene, the BLEU score on terms compares the candidate sentence against the reference sentences and then estimates how well the candidate sentence is blended in accordance with the reference sentences. In this way, it rates the score between the range of 0 – 1, respectively.

Calculation of BLEU score in Python

To implement the BLEU score, we’ll use the NLTK module which consists of sentence_bleu() function. It enables us to pass the reference sentences and a candidate sentence. Then, it checks the candidate sentence against the reference sentences.

If a perfect match is found, it returns 1 as the BLEU score. If no match at all, it returns 0. For a partial match, the BLUE score will be between 0 and 1.

Implementing BLEU Score

In the below example,

  1. We have imported the NLTK library and the sentence_bleu submodule.
  2. Further, we generate a list of reference statements and point them through the object ref.
  3. Then we create a test sentence and use sentence_bleu() to test it against ref.
  4. As a result, it gives an approximate output as 1.
  5. The next time, we create a test01 statement and pass it to the function.
  6. As the statement consists moonlight which is a part of the reference statements but not exactly a match for reference statements, thus it returns an approximate value close to 0.
from nltk.translate.bleu_score import sentence_bleu
ref = [
    'this is moonlight'.split(),
    'Look, this is moonlight'.split(),
    'moonlight it is'.split()
test = 'it is moonlight'.split()
print('BLEU score for test-> {}'.format(sentence_bleu(ref, test)))

test01 = 'it is cat and moonlight'.split()
print('BLEU score for test01-> {}'.format(sentence_bleu(ref, test01)))


BLEU score for test-> 1.491668146240062e-154
BLEU score for test01-> 9.283142785759642e-155

Implementing N-gram score in Python

As seen above, by default, the sentence_bleu() function searches for 1 word in the reference statements for a match. We can have multiple words in the queue to be searched against the reference statements. This is known as N-gram.

  • 1-gram: 1 word
  • 2-gram: pairs of words
  • 3-gram: triplets, etc

For the same, we can pass the below parameters to the sentence_bleu() function for implementation of N-gram:

1-gram: (1, 0, 0, 0)
2-gram: (0, 1, 0, 0) 
3-gram: (1, 0, 1, 0)
4-gram: (0, 0, 0, 1)


In the below example, we have calculated the 2-gram BLEU score for the candidate sentence test01 using the reference statements ref as mentioned below using sentence_bleu() function, passing the weights for 2-gram score i.e. (0,1,0,0).

from nltk.translate.bleu_score import sentence_bleu
ref = [
    'this is moonlight'.split(),
    'Look, this is moonlight'.split(),
    'moonlight it is'.split()
test01 = 'it is cat and moonlight'.split()
print('2-gram:' sentence_bleu(ref, test01, weights=(0, 1, 0, 0)))


2-gram: 0.25


By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to Python programming, Stay tuned with us.

Till then, Happy learning!! 馃檪