Optical Character Recognition (OCR) in Python

Optical Character Recognition

In this article, we will know how to perform Optical Character Recognition using PyTesseract or python-tesseract. Pytesseract is a wrapper for Tesseract-OCR Engine. Tesseract is an open-source OCR Engine, managed by Google.

There are times when we have texts in our images and we need to type it on our computer.

It is very easy for us to perceive what is written in the image but for a computer to understand the texts inside the image is a really difficult task.

A computer will just perceive an image as an array of pixels.

OCR comes in handy with this task. OCR detects the text content on images and translates the information to encoded text that the computer can easily understand.

In this article we’ll see how to perform OCR task with Python.

Implementing Basic Optical Character Recognition in Python

Install the Python wrapper for tesseract using pip.

$ pip install pytesseract

You can refer to this query on stack overflow to get details about installing Tesseract binary file and making pytesseract work.

1. Get An Image With Clearly Visible Text

Let’s now look at one sample image and extract text from it.

Sample
Sample text image

2. Code to Extract Text From Image

The image above is in jpeg format and we’ll try to extract the text information from it.

#Importing libraries
import cv2
import pytesseract

#Loading image using OpenCV
img = cv2.imread('sample.jpg')

#Converting to text
text = pytesseract.image_to_string(img)

print(text)

Output:

On the Insert tab, the galleries include items that are designed
to coordinate with the overall look of your document. You can
use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create
pictures, charts, or diagrams, they also coordinate with your
current document look.

After loading the image using OpenCV, we used pytesseract image_to_string method which needs an image as an input argument. This single line of code will transform the text information in the images to encoded texts.

However, real-life tasks for OCR would be challenging if we don’t preprocess the images as the efficiency of conversion is directly affected by the quality of the input image.

Implementing OCR After Preprocessing Using OpenCV

Steps we’ll use to preprocess our image:

  • Convert image to Grayscale – Images need to be converted into a binary image, so first, we convert the colored image to grayscale.
  • Thresholding is used to convert grayscale images into binary images. it decides whether the value of the pixel is below or above a certain threshold. All pixels below are turned to a white pixel, all pixels above are turned to a black pixel.
  • Now invert the image to using bitwise_not operation.
  • Applying various noise reduction techniques like eroding, dilating, etc.
  • Applying the text extraction method to the preprocessed image.

1. Find an Image With Clear Text

Let’s implement above steps in a code using the image below:

Sample For Testing Optical Character recognition
Sample For Test

2. Complete Code to Preprocess and Extract Text from Images using Python

We’ll now follow the steps to pre-process the file and extract the text from the image above. Optical character recognition works best when the image is readable and clear for the machine learning algorithm to take cues from.

#Importing libraries
import cv2
import pytesseract
import numpy as np

#Loading image using OpenCV
img = cv2.imread('sample_test.jpg')

#Preprocessing image
#Converting to grayscale
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

#creating Binary image by selecting proper threshold
binary_image = cv2.threshold(gray_image ,130,255,cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

#Inverting the image
inverted_bin = cv2.bitwise_not(binary_image)

#Some noise reduction
kernel = np.ones((2,2),np.uint8)
processed_img = cv2.erode(inverted_bin, kernel, iterations = 1)
processed_img = cv2.dilate(processed_img, kernel, iterations = 1)

#Applying image_to_string method
text = pytesseract.image_to_string(processed_img)

print(text)

Output:

On the Insert tab, the galleries include items that are designed
to coordinate with the overall look of your document. You can
use these galleries to insert tables, headers, footers, lists, cover
pages, and other document building blocks. When you create
pictures, charts, or diagrams, they also coordinate with your
current document look,

You can easily change the formatting of selected text in the
documenttext by choosing a look for the selected text from the
Quick Styies gallery on the Home tab. You can also format text
directly by using the other controls on the Home tab. Most
controls offer a choice of using the look from the current theme

or using a tormat that you specify directly.

To change the overall look of your document, choose new
Theme elements on the Page Layout tab. To change the looks
available in the Quick Style gallery, use the Change Current
Quick Style Set command. Both the Themes gallery and the
Quick Styles gallery provide reset commands so that you can

You can know more about OpenCV and its functions for image transformations here.

Conclusion

This article was all about implementing optical character recognition in Python using PyTesseract wrapper and some pre-processing steps that might be helpful to get better results.

Happy Learning!