This tutorial will teach you how to extract text from PDF using Python. PDF is a popular document format for storing information because it maintains the same formatting regardless of the device used. However, extracting text from them can be challenging. But fear not; Python can make this task easy.
Before we proceed, let's take a brief look at what a PDF is.
What is a PDF?
PDF (Portable Document Format) is a file format developed by Adobe that can be used to present and exchange documents reliably, independent of software, hardware, or operating system. PDFs can contain links and buttons, form fields, audio, video, and business logic.
Extracting Text from PDFs in Python
There are some libraries available in Python that can be used to handle PDFs. For this tutorial, we will use
PyPDF2, known for its simplicity and excellent functionality.
Here is a straightforward Python program to extract text from a PDF:
Python Program to Extract Text from a PDF
# Open the PDF file in read-binary mode
pdf_file = open('sample.pdf', 'rb')
# Create a PDF file reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Get the number of pages in the PDF
num_pages = pdf_reader.numPages
# Initialize an empty string to store the text
pdf_text = ""
# Loop through each page and extract the text
for page in range(num_pages):
page_obj = pdf_reader.getPage(page)
pdf_text += page_obj.extractText()
# Close the PDF file
# Print the extracted text
In the above program, we first import the necessary module,
PyPDF2. We open the PDF file in read-binary mode
('rb') and create a PDF reader object.
Then, we get the number of pages in the PDF using the numPages attribute. We initialize an empty string,
pdf_text, to store the extracted text.
We then loop through each page in the PDF and extract the text from each page using the
extractText() method, and append it to our
Finally, we close the PDF file and print out the extracted text.
The output of this program is the text content from 'sample.pdf,' which will be printed in the console.
PyPDF2 may not always extract text as expected if the PDF is image-based or contains complex layouts. In such cases, OCR (Optical Character Recognition) libraries like Tesseract or PDF processing libraries like PDFMiner could be more helpful.
This is a basic tutorial on extracting text from PDFs with Python. There are more advanced use cases depending on your needs. For instance, you can modify this program to process multiple PDF files in a directory or extract specific pages from a PDF.
The 'PyPDF2' Python library can be installed using pip, Python's package manager. The following command installs the package:
pip install PyPDF2
I hope you found this tutorial helpful in learning how to extract text from PDFs using Python. Feel free to explore and experiment more with the
PyPDF2 library according to your project requirements.