We will create an example using Python programming language how to extract text from PDF file.
In this example we are going to use PyPDF2 package from Python to work with PDF file.
There are few advantages using PDF file:
- PDF format allows professionals to edit, share, collaborate and ensure the security of the content within digital documents.
- Reports are mostly generated in PDF format because a PDF file is a “read only” document that cannot be altered without leaving an electronic footprint.
- PDF files are compatible across multiple platforms.
Python 3.8.3, PyPDF2 (pip install PyPDF2)
Extract Text from PDF
First we import the required library PyPDF2, then we open and read the PDF file.
We count the number of pages in the PDF file. Then we iterate each page for the total number of pages and extract the text and append into a list variable.
Finally we print the extracted text on the console.
#Importing PDF reader PyPDF2 import PyPDF2 #Open file Path pdf_File = open('simple.pdf', 'rb') #Create PDF Reader Object pdf_Reader = PyPDF2.PdfFileReader(pdf_File) count = pdf_Reader.numPages # counts number of pages in pdf TextList =  #Extracting text data from each page of the pdf file for i in range(count): try: page = pdf_Reader.getPage(i) TextList.append(page.extractText()) except: pass #Converting multiline text to single line text TextString = " ".join(TextList) print(TextString)
Testing the Program
Running the above code you will see the following output:
You can download the sample pdf file used in this example from the below source code section.
Thanks for reading.Tags: pdf