This tutorial shows a guide on how to read word file using Python. We know that word is great for documentation. This tutorial also shows how to install
nltk modules when not available in Python in Windows Operating System. These modules are required to read word or docx file using Python. This tool is used in many areas and some of them are given below:
You can create all types of official documents in Microsoft Word.
You can create lecture script by using text, word art, shapes, colors, and images.
You can create a birthday card, invitation card in Microsoft Word by using pre-defined templates or using insert menu and format menus functions.
You can highlight basic and advance knowledge of MS Word as great skill in your resume for the job interview.
You can create notes and assignment on MS-word.
You can create and print a book using MS Word by creating a cover page, content, head and footers, image adjustments, text alignment and text highlighter etc.
You can start your business online and offline. You need to create documents for official works.
You can use Microsoft word to collaborate with your team while working on the same project and document.
What’s more, this software is widely used in many different application fields all over the world and it also applies to data science.
You may like to read:
We have seen various operations on word files using wonderful API – Apache POI in Java technology and it requires few more lines of code have to be written to read from or write to word files.
But to read word file using Python is very easy with a few lines of code. We will use a sample word file here to read the word file.
You may also download the sample word file through Google search and give it a try.
Let’s move on to the example…
Have Python installed in Windows (or Unix)
Pyhton version and Packages
Here I am using Python 3.6.5 version
package – docx, nltk
Preparing your workspace
Preparing your workspace is one of the first things that you can do to make sure that you start off well. The first step is to check your working directory.
When you are working in the Python terminal, you need first navigate to the directory, where your file is located and then start up Python, i.e., you have to make sure that your file is located in the directory where you want to work from.
Check Required Modules
Check for modules
nltk in Python terminal. Type the command as shown below to check
nltk package. If you do not get any error message then the module exists otherwise you have to install the non-existence module.
>> import docx
>> import nltk
If you do not have
nltk module available then please find below steps to install docx and nltk modules in Windows Operating System.
1. Please make sure you open cmd prompt in administrator mode
2. Execute below command to install
Now we will see how to install nltk module
1. Execute below command to install nltk module. Make sure you open cmd prompt in administrator mode.
2. Installing nltk is not enough as shown above, you need to download the required packages. So download using the below command.
3. Now a popup window will open for downloading required packages
4. Once required packages are downloaded, you should see following screen.
You are done installing nltk.
Now let’s move on to the example read word file using Python.
In the below image you see I have opened a cmd prompt and navigated to the directory where I have put the word file that has to be read.
We will read the below word file using Python programming language. We will read the whole content from word file and display those content into Python console. You may read the word file content and do something else for your business using the Python programming.
The above word file should be put into the C:\py_scripts directory where we will also put the Python script to read the word file.
Now create a Python script read_word.py under the C:\py_scripts for reading the above word file. Here py is extension of the Python file.
In the below Python script notice how we imported docx and nltk module.
The below Python script shows how to read word file using Python.
import docx #Extract text from DOCX def getDocxContent(filename): doc = docx.Document(filename) fullText = "" for para in doc.paragraphs: fullText += para.text return fullText resume = getDocxContent("sample.docx") #Importing NLTK for sentence tokenizing from nltk.tokenize import sent_tokenize sentences = sent_tokenize(resume) for sentence in sentences: print(sentence) print("\n")
When you execute the above Python script, then you should see the following output in the console.
Here is the sample file.
Hope you understood how to read word file using Python.
Thanks for reading.Tags: docx • word