PyPDF2: Python Library for PDF Files Manipulations

PyPDF2 Python PDF Toolkit Png

PyPDF2 is a pure-python library to work with PDF files. We can use the PyPDF2 module to work with the existing PDF files. We can’t create a new PDF file using this module.

PyPDF2 Features

Some of the exciting features of PyPDF2 module are:

  • PDF Files metadata such as a number of pages, author, creator, created and last updated time.
  • Extracting Content of PDF file page by page.
  • Merge multiple PDF files.
  • Rotate PDF file pages by an angle.
  • Scaling of PDF pages.
  • Extracting images from PDF pages and saving them as images using the Pillow library.

Installing PyPDF2 Module

We can use PIP to install PyPDF2 module.

$ pip install PyPDF2

PyPDF2 Examples

Let’s look at some examples to work with PDF files using the PyPDF2 module.

1. Extracting PDF Metadata

We can get the number of pages in the PDF file. We can also get information about the PDF author, creator app, and creation dates.

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    print(f'Number of Pages in PDF File is {pdf_reader.getNumPages()}')
    print(f'PDF Metadata is {pdf_reader.documentInfo}')
    print(f'PDF File Author is {pdf_reader.documentInfo["/Author"]}')
    print(f'PDF File Creator is {pdf_reader.documentInfo["/Creator"]}')

Sample Output:

Number of Pages in PDF File is 2
PDF Metadata is {'/Author': 'Microsoft Office User', '/Creator': 'Microsoft Word', '/CreationDate': "D:20191009091859+00'00'", '/ModDate': "D:20191009091859+00'00'"}
PDF File Author is Microsoft Office User
PDF File Creator is Microsoft Word

Recommended Readings: Python with Statement and Python f-strings

  • The PDF file should be opened in the binary mode. That’w why the file opening mode is passed as ‘rb’.
  • The PdfFileReader class is used to read the PDF file.
  • The documentInfo is a dictionary that contains the metadata of the PDF file.
  • We can get the number of pages in the PDF file using the getNumPages() function. An alternative way is to use the numPages attribute.

2. Extracting Text of PDF Pages

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # printing first page contents
    pdf_page = pdf_reader.getPage(0)
    print(pdf_page.extractText())

    # reading all the pages content one by one
    for page_num in range(pdf_reader.numPages):
        pdf_page = pdf_reader.getPage(page_num)
        print(pdf_page.extractText())
  • The PdfFileReader getPage(int) method returns the PyPDF2.pdf.PageObject instance.
  • We can call the extractText() method on the page object to get the text content of the page.
  • The extractText() will not return any binary data such as images.

3. Rotate PDF File Pages

The PyPDF2 allows many types of manipulations that can be done page-by-page. We can rotate a page clockwise or counter-clockwise by an angle.

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    pdf_writer = PyPDF2.PdfFileWriter()

    for page_num in range(pdf_reader.numPages):
        pdf_page = pdf_reader.getPage(page_num)
        pdf_page.rotateClockwise(90)  # rotateCounterClockwise()

        pdf_writer.addPage(pdf_page)

    with open('Python_Tutorial_rotated.pdf', 'wb') as pdf_file_rotated:
        pdf_writer.write(pdf_file_rotated)
  • The PdfFileWriter is used to write the PDF file from the source PDF.
  • We are using rotateClockwise(90) method to rotate the page clockwise by 90-degrees.
  • We are adding the rotated pages to the PdfFileWriter instance.
  • Finally, the write() method of the PdfFileWriter is used to produce the rotated PDF file.

The PdfFileWriter can write PDF files from some source PDF files. We can’t use it to create a PDF file from some text data.

4. Merge PDF Files

import PyPDF2

pdf_merger = PyPDF2.PdfFileMerger()
pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']

for pdf_file_name in pdf_files_list:
    with open(pdf_file_name, 'rb') as pdf_file:
        pdf_merger.append(pdf_file)

with open('Python_Tutorial_merged.pdf', 'wb') as pdf_file_merged:
    pdf_merger.write(pdf_file_merged)

The above code looks good to merge the PDF files. But, it produced an empty PDF file. The reason is that the source PDF files got closed before the actual write happened to create the merged PDF file.

It’s a bug in the latest version of PyPDF2. You can read about it in this GitHub issue.

There is an alternative approach to using the contextlib module to keep the source files open until the write operation is done.

import contextlib
import PyPDF2

pdf_files_list = ['Python_Tutorial.pdf', 'Python_Tutorial_rotated.pdf']

with contextlib.ExitStack() as stack:
    pdf_merger = PyPDF2.PdfFileMerger()
    files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdf_files_list]
    for f in files:
        pdf_merger.append(f)
    with open('Python_Tutorial_merged_contextlib.pdf', 'wb') as f:
        pdf_merger.write(f)

You can read more about it at this StackOverflow Question.

5. Split PDF Files into Single Pages Files

import PyPDF2

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    for i in range(pdf_reader.numPages):
        pdf_writer = PyPDF2.PdfFileWriter()
        pdf_writer.addPage(pdf_reader.getPage(i))
        output_file_name = f'Python_Tutorial_{i}.pdf'
        with open(output_file_name, 'wb') as output_file:
            pdf_writer.write(output_file)

The Python_Tutorial.pdf has 2 pages. The output files are named as Python_Tutorial_0.pdf and Python_Tutorial_1.pdf.

6. Extracting Images from PDF Files

We can use PyPDF2 along with Pillow (Python Imaging Library) to extract images from the PDF pages and save them as image files.

First of all, you will have to install the Pillow module using the following command.

$ pip install Pillow

Here is a simple program to extract images from the first page of the PDF file. We can easily extend it further to extract all the images from the PDF file.

import PyPDF2
from PIL import Image

with open('Python_Tutorial.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # extracting images from the 1st page
    page0 = pdf_reader.getPage(0)

    if '/XObject' in page0['/Resources']:
        xObject = page0['/Resources']['/XObject'].getObject()

        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                data = xObject[obj].getData()
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                else:
                    mode = "P"

                if '/Filter' in xObject[obj]:
                    if xObject[obj]['/Filter'] == '/FlateDecode':
                        img = Image.frombytes(mode, size, data)
                        img.save(obj[1:] + ".png")
                    elif xObject[obj]['/Filter'] == '/DCTDecode':
                        img = open(obj[1:] + ".jpg", "wb")
                        img.write(data)
                        img.close()
                    elif xObject[obj]['/Filter'] == '/JPXDecode':
                        img = open(obj[1:] + ".jp2", "wb")
                        img.write(data)
                        img.close()
                    elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                        img = open(obj[1:] + ".tiff", "wb")
                        img.write(data)
                        img.close()
                else:
                    img = Image.frombytes(mode, size, data)
                    img.save(obj[1:] + ".png")
    else:
        print("No image found.")

My sample PDF file has a PNG image on the first page and the program saved it with an “image20.png” filename.

References