Extracting Text from PDF Files Using Python

This tutorial will teach you how to extract text from PDF using Python. PDF is a popular document format for storing information because it maintains the same formatting regardless of the device used. However, extracting text from them can be challenging. But fear not; Python can make this task easy.

Before we proceed, let's take a brief look at what a PDF is.

What is a PDF?

PDF (Portable Document Format) is a file format developed by Adobe that can be used to present and exchange documents reliably, independent of software, hardware, or operating system. PDFs can contain links and buttons, form fields, audio, video, and business logic.

Extracting Text from PDFs in Python

There are some libraries available in Python that can be used to handle PDFs. For this tutorial, we will use PyPDF2, known for its simplicity and excellent functionality.

Here is a straightforward Python program to extract text from a PDF:

Python Program to Extract Text from a PDF

import PyPDF2

# Open the PDF file in read-binary mode
pdf_file = open('sample.pdf', 'rb')

# Create a PDF file reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Get the number of pages in the PDF
num_pages = pdf_reader.numPages

# Initialize an empty string to store the text
pdf_text = ""

# Loop through each page and extract the text
for page in range(num_pages):
    page_obj = pdf_reader.getPage(page)
    pdf_text += page_obj.extractText()

# Close the PDF file
pdf_file.close()

# Print the extracted text
print(pdf_text)

In the above program, we first import the necessary module, PyPDF2. We open the PDF file in read-binary mode ('rb') and create a PDF reader object.

Then, we get the number of pages in the PDF using the numPages attribute. We initialize an empty string, pdf_text, to store the extracted text.

We then loop through each page in the PDF and extract the text from each page using the extractText() method, and append it to our pdf_text string.

Finally, we close the PDF file and print out the extracted text.

Program Output:

The output of this program is the text content from 'sample.pdf,' which will be printed in the console.

Please note, PyPDF2 may not always extract text as expected if the PDF is image-based or contains complex layouts. In such cases, OCR (Optical Character Recognition) libraries like Tesseract or PDF processing libraries like PDFMiner could be more helpful.

This is a basic tutorial on extracting text from PDFs with Python. There are more advanced use cases depending on your needs. For instance, you can modify this program to process multiple PDF files in a directory or extract specific pages from a PDF.

Install PyPDF2

The 'PyPDF2' Python library can be installed using pip, Python's package manager. The following command installs the package:

pip install PyPDF2

I hope you found this tutorial helpful in learning how to extract text from PDFs using Python. Feel free to explore and experiment more with the PyPDF2 library according to your project requirements.

Found This Page Useful? Share It!

Get the Latest Tutorials and Updates

Join us on Telegram

Keep W3schools Growing with Your Support!

❤️ Support W3schools

Computer Science Fundamentals

Web Design

Programming Languages

Scripting Languages

Web Development

Database Design and Development

Data Interchange Formats

Software Development Tools

Containerization

Mobile Development

Network & Security

Artificial Intelligence

Special Topics

Key Definitions

Extracting Text from PDF Files Using Python

Python Programming Examples

What is a PDF?

Extracting Text from PDFs in Python

Python Program to Extract Text from a PDF

Install PyPDF2