OCR- Extract Individual Fields of a Table Image to Excel

OCR stands for optical character recognition and is used to obtain text from image formats. OCR is often used to retrieve data from scanned documents. You must have gone through various applications to obtain text from PDFs and images. All these applications internally use optical character recognition. While the data present in the scanned documents or images is not editable, after using OCR, we obtain data that can be editable and searchable. OCR ofter is also called text recognition.

Refer to this article to get a deep understanding of OCR.

OCR converts data in hand-written or printed texts into human or even machine-readable text. OCR can be hardware and software systems that convert physical documents into editable text.

OCR is used not only in digitizing handwritten old scripts but also helps in converting typewritten documents into digital form. The base technology underlying optical character recognition is computer vision, as in order to implement OCR, we need to acquire and perform image processing, which can be carried out using Computer Vision.

While we are at it, let us also learn about image recognition

Applications of OCR

There are various applications /uses of optical character recognition. Some of them are given below.

Digitization of documents: OCR is primarily used to convert physical documents(hard copies) into digital counterparts(soft copies). This is not only helpful in the banking sector, but it is also used to preserve religious handwritten manuscripts.

Data Entry: OCR can be used to automate manual and time-consuming data entry. Automating the data entry process helps to reduce labor and increase efficiency.

Various Software Applications: OCR often becomes the base for mobile applications such as scanners, which produce editable text from scanned documents or images.

In this article, we are going to take an image of a table with data and extract individual fields in the table to Excel.

OCR Using Pytesseract

Pytesseract or Python-Tesseract is a tool specifically designed to make OCR easy and simple. It is a Python wrapper for Google’s Tesseract OCR. Pytesseract is available in the third-party repository – PyPi.

To use this tool, we need to first install it. Installation can be done as follows.

pip install pytesseract

We also need OpenCV for this task. It can be installed by the command given below.

pip install opencv-python

Now that we have all the necessary libraries and tools let us start with the code!

The image we are going to use here is shown below.

Before we perform OCR, we need to first convert the image into a grayscale and perform preprocessing of the image. The code is given below.

import cv2
import pytesseract
import pandas as pd
img = "/content/excel ss.png"
image = cv2.imread(img)
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, threshold_image = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
custom_config = r'--oem 3 --psm 6'
text_data = pytesseract.image_to_string(threshold_image, config=custom_config)
rows = text_data.split("\n")
table_data = [row.split("\t") for row in rows]
for row in table_data:
    for i, cell in enumerate(row):
        row[i] = ''.join(char for char in cell if ord(char) < 128)

df = pd.DataFrame(table_data)
df = df.applymap(lambda x: x if x.strip() != "" else pd.NA)
output_excel_file = "sheet.xlsx"
df.to_excel(output_excel_file, index=False, header=False)

Using the first three lines of the code, we import the cv2 library, the py-tesseract wrapper, and the pandas library. The instance img consists of the path to the table image we want to use.

The imread method of the cv2 library is used to load the image, and the gray_image consists of the same image but in grayscaleThis conversion is also done using the BGR2GRAY component of the library.

Next, we need to perform some cleaning. We need to store the image as a binary component, for which we use thresholding. Thresholding is used to convert an image into its binary form. The threshold function takes four parameters. The first parameter should be the input image(grayscaled), the second parameter is the threshold value to be applied for each pixel, the third argument is the maximum value that is assigned to pixel values exceeding the threshold, and finally, the type of thresholding to use. We are here using the THRESH_BINARY and the THRESH_OTSU is supplied as a flag for arbitrary thresholding.

In the next line, we see a custom configuration using the Tesseract wrapper. This line of code allows us to choose the parameters by ourselves.

oem: OEM stands for OCR Engine Modes and determines which engine to use considering the performance and speed. Here we used oem 3

psm: This stands for Page Segmentation Modes. psm is used to decide how we consider the image. We can take the image as a single block of text, a single word, and so on. Here we used psm 6.

Below is the detailed description of both configuration modes I found in Stack Overflow.

In the following line, we are extracting the text data from the image. We are also removing any non-ASCII characters (characters with ASCII value less than 128).

The cleaned table is read as a data frame where the missing values(if any) are handled with the help of pd.NA, which is then saved as an Excel file using df.to_excel. The Excel file is stored in sheet.xlsx

The output is given below.

OCR Using Pytesseract — OCR using Pytesseract

As you can see, we have extracted each individual field of the table into the Excel file.

OCR Using Easyocr

EasyOCR is another Python method that is used to perform OCR in Python. We can install this package by the following command.

pip install easyocr

Let us see how we can use easyocr to solve our task.

import pandas as pd
import easyocr
imgp = "/content/excel ss.png"
reader = easyocr.Reader(['en'])  
results = reader.readtext(imgp)
tabledata = []
for (bbox, text, prob) in results:
    tabledata.append([text])
df = pd.DataFrame(tabledata)
output_file = "easyocr.xlsx"
df.to_excel(output_file, index=False, header=False)

In the first two lines, we are importing the Pandas and EayOCR libraries. Next, we are loading the image. The image is stored in imgp.

We are initializing an instance called reader for the Reader class, which specifies which language the model should recognize(English in this case).

The readtext component is used to identify the text in the image. We are initializing an empty list called tabledata, which stores the text recognized by the readtext. This list is converted into a data frame called df. The Excel file is named easyocr.xlsx .

The method df.to_excel is used to convert the data frame to Excel.

Here is the output.

We have extracted the individual fields row-wise in this example.

Summary

To summarize what we have done, we learned what is Optical character recognition(OCR) and observed some of its widespread applications. It is used in banking to verify cheques and transaction receipts, reduce labor in data entry, digitize transcripts, and so on.

The task at our hand was to extract text from an image of a table with fields and store it in an Excel sheet.

We saw two approaches to this task- Using Pytesseract and EasyOCR.

References

Pytesseract documentation

EasyOCR documentation

OPEN CV image thresholding

Stack Overflow