Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages
wb_sunny

How to Process Text from PDF Files in Python?

Read Pdf

PDFs are a common way to share text. PDF stands for Portable Document Format and uses the .pdf file extension. It was created in the early 1990s by Adobe Systems.

Reading PDF documents using python can help you automate a wide variety of tasks.

In this tutorial we will learn how to extract text from a PDF file in Python.

Let’s get started.

Reading and Extracting Text from a PDF File in Python

For the purpose of this tutorial we are creating a sample PDF with 2 pages. You can do so using any Word processor like Microsoft Word or Google Docs and save the file as a PDF.

Text on page 1:

Hello World. 
This is a sample PDF with 2 pages. 
This is the first page. 

Text on page 2:

This is the text on Page 2. 

Using PyPDF2 to Extract PDF Text

You can use PyPDF2 to extract text from a PDF. Let’s see how it works.

1. Install the package

To install PyPDF2 on your system enter the following command on your terminal. You can read more about the pip package manager.

pip install pypdf2
Pypdf
Pypdf

2. Import PyPDF2

Open a new python notebook and start with importing PyPDF2.

import PyPDF2

3. Open the PDF in read-binary mode

Start with opening the PDF in read binary mode using the following line of code:

pdf = open('sample_pdf.pdf', 'rb')

This will create a PdfFileReader object for our PDF and store it to the variable ‘pdf’.

4. Use PyPDF2.PdfFileReader() to read text

Now you can use the PdfFileReader() method from PyPDF2 to read the file.

pdfReader = PyPDF2.PdfFileReader(pdf)

To get the text from the first page of the PDF, use the following lines of code:

page_one = pdfReader.getPage(0)
print(page_one.extractText())

We get the output as:

Hello World. 
!This is a sample PDF with 2 pages. !This is the first page. !

Process finished with exit code 0

Here we used the getPage method to store the page as an object. Then we used extractText() method to get text from the page object.

The text we get is of type String.

Similarly to get the second page from the PDF use:

page_one = pdfReader.getPage(1)
print(page_one.extractText())

We get the output as :

This is the text on Page 2. 

Complete Code to Read PDF Text using PyPDF2

The complete code from this section is given below:

import PyPDF2
pdf = open('sample_pdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf)
page_one = pdfReader.getPage(0)
print(page_one.extractText())

If you notice, the formatting of the first page is a little off in the output above. This is because PyPDF2 is not very efficient at reading PDFs.

Luckily, Python has a better alternative to PyPDF2. We are going to look at that next.

Using PDFplumber to Extract Text

PDFplumber is another tool that can extract text from a PDF. It is more powerful as compared to PyPDF2.

1. Install the package

Let’s get started with installing PDFplumber.

pip install pdfplumber
Pdfplumber
Pdfplumber

2. Import pdfplumber

Start with importing PDFplumber using the following line of code :

import pdfplumber

3. Using PDFplumber to read pdfs

You can start reading PDFs using PDFplumber with the following piece of code:

with pdfplumber.open("sample_pdf.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())

This will get the text from first page of our PDF. The output comes as:

Hello World. 

This is a sample PDF with 2 pages. 

This is the first page. 


Process finished with exit code 0

You can compare this with the output of PyPDF2 and see how PDFplumber is better when it comes to formatting.

PDFplumber also provides options to get other information from the PDF.

For example, you can use .page_number to get the page number.

print(first_page.page_number)

Output :

1

To learn more about the methods under PDFPlumber refer to its official documentation.

Conclusion

This tutorial was about reading text from PDFs. We looked at two different tools and saw how one is better than the other.

Now that you know how to read text from a PDF, you should read our tutorial on tokenization to get started with Natural Language Processing!