Extract text and images from doc or docx file with Python

One of the best things about Python is it's amazing ability to work with the platform specific tasks. Especially on Windows, whenever there is task need to access platform facilities, Python is the best choice, and everytime it won't disappoint me. It can interact with win32 API, it can invoke COM objects, which is indispensable when you want to automate some work process to gain productivity.

Suppose you are tasked with the job that given a doc or docx file, you need to post the content to a CMS. You can't just copy and paste, especially the images, instead, you need to save images and upload to the server and embed to the content. This is a cumbersome process. It will be good if we can extract the text and images and store them separately.

The following code will do it for us

 
import win32com
from win32com.client import Dispatch
import docx
import zipfile
import os
import shutil
 
def doc2docx(path):
    word = win32com.client.Dispatch('word.application')
    word.DisplayAlerts = 0
    word.visible = 0
    doc = word.Documents.Open(path)
    doc.SaveAs(path+"x", 12)
    doc.Close()
    word.Quit()
 
def extracttext(docxpath):
    doc = docx.Document(docxpath)
    fp = open("f:/tempbuffer.txt", "a", encoding="utf-8")
    for p in doc.paragraphs:
        fp.write(p.text+u"\n")
    fp.close()
 
 
def extractimgs(docxpath, dstpath):
    doc = zipfile.ZipFile(docxpath)
    for info in doc.infolist():
       if info.filename.endswith((".png", ".jpeg", ".gif")):
           doc.extract(info.filename, dstpath)
           shutil.copy(dstpath+"\\"+info.filename, dstpath+"\\"+ docxpath.split("\\")[-1] + info.filename.split("/")[-1])
    doc.close()
 
docpath = "my.doc"
docxpath = docpath + "x"
doc2docx(docpath)
extracttext(docxpath)
extractimgs(docxpath, "f:\\imgs")
 

The docx format is just a zipfile which is easier to deal with, so the first step is to convert the file to docx format if we are given a doc file. You can open the file in word processor and click Save As to do it manually, but it will be more convenient if we can do it programatically. With the win32com module, you can do the same thing in Python by invoking the same "Save As" feature through COM object. You can create a COM object by invoking win32com.client.Dispatch with the ProgID or CLSID as parameter. To create a Word object, set the ProgID as word.application. With the COM object we can open a doc file and Save it as docx format.

How to process Python line ending

Programmers deal with text every day everywhere, no matter which programming language is used. There are two issues many have struggled and fretted about, one is encoding, another is line ending. Here we talk about the Python and line ending combo.

First we should recap about the basics. The \r is return, the ASCII code is 0x0d, the shorthand is CR, carriage return, \n is newline, the ASCII code is 0x0a, the shorthand is LF, line feeding. On a typewriter, carriage return moves horizontally back to the beginning of the current line, while line feed moves vertically to next line. Windows uses \r\n, 0x0d0x0a as line ending, unix uses \n, Mac uses \r.

While reading lines from text file always ends with line ending, reading from format like docx you get paragraph and text, surprisingly, the concept of paragraph and text have nothing to do with line endings, CR and LF are not considered as text. If you loop over all paragraphs and print to the console, it resembles the structure of the document, even images which are empty paragraphs, but it's an illusion, because each invocation of print automatically opens a new line. But to write to a file results in all paragraphs are crammed into a single line of text.

To preserve the structure of paragraphs we should add a line ending to each paragraph. In Windows it should be "\r\n", but the result file displays ^M at end of each line, turns out, the Python has the concpet of universal newlines which uses a universal escape to represent new line and is convert to platform specific convention, so if you write \r\n, you end up with \r\r\n. This is nice feature, because our script will works on every platform without worry about the line ending format, but every programming language has it's own solution of universal line ending, most of them implement it as a system constant, some of them such as Python change the meaning of existing character, this makes the thing more complicated.

The final step is to extract image files and store them to destination folder.