Extract text and images from doc or docx file with Python

One of the best things about Python is its amazing ability to work with the platform specific tasks. Especially on Windows, whenever there is task need to access platform facilities, Python is the best choice, and everytime it won't disappoint me. It can interact with win32 API, it can invoke COM objects, which is indispensable when you want to automate some work process to gain productivity.

Suppose you are tasked with the job that given a doc or docx file, you need to post the content to a CMS. You can't just copy and paste, especially the images, instead, you need to save images and upload to the server and insert them into the content. You have to click on every image and save as file. This is a cumbersome and repetitive process. It will be good if we can extract the text and images and store them separately.

Turns out, this can be easily done in Python with a few lines of code as shown below.

 
import win32com
from win32com.client import Dispatch
import docx
import zipfile
import os
import shutil
 
def doc2docx(path):
    word = win32com.client.Dispatch('word.application')
    word.DisplayAlerts = 0
    word.visible = 0
    doc = word.Documents.Open(path)
    doc.SaveAs(path+"x", 12)
    doc.Close()
    word.Quit()
 
def extracttext(docxpath):
    doc = docx.Document(docxpath)
    fp = open("f:/tempbuffer.txt", "a", encoding="utf-8")
    for p in doc.paragraphs:
        fp.write(p.text+u"\n")
    fp.close()
 
 
def extractimgs(docxpath, dstpath):
    doc = zipfile.ZipFile(docxpath)
    for info in doc.infolist():
       if info.filename.endswith((".png", ".jpeg", ".gif")):
           doc.extract(info.filename, dstpath)
           shutil.copy(dstpath+"\\"+info.filename, dstpath+"\\"+ docxpath.split("\\")[-1] + info.filename.split("/")[-1])
    doc.close()
 
docpath = "my.doc"
docxpath = docpath + "x"
doc2docx(docpath)
extracttext(docxpath)
extractimgs(docxpath, "f:\\imgs")
 

The docx format is just a zipfile which is easier to deal with, so the first step is to convert the file to docx format if we are given a doc file. You can open the file in word processor and click Save As to do it manually, but it will be more convenient if we can do it programatically. With the win32com module, you can do the same thing in Python by invoking the same "Save As" feature through COM object. You can create a COM object by invoking win32com.client.Dispatch with the ProgID or CLSID as parameter. To create a Word object, set the ProgID as word.application. With the COM object we can open a doc file and Save it as docx format.

How to process Python line ending

Programmers deal with text every day everywhere, no matter which programming language is used. There are two issues many have struggled and fretted about, one is encoding, another is line ending. Here we are using Python the language and docx the format, both have their own quirks about line ending we need to deal with.

First we should recap about the fundamentals. The \r is return, its ASCII code is 0x0d, the shorthand is CR, the carriage return, \n is newline, the ASCII code is 0x0a, the shorthand is LF, the line feeding. On a typewriter, carriage return moves horizontally back to the beginning of the current line, while line feed moves vertically to next line. Windows uses \r\n, 0x0d0x0a as line ending, unix uses \n, Mac uses \r.

While reading lines from text file, each line always ends with its own line ending, but reading from format like docx you get paragraph and text, surprisingly, the concept of paragraph and text have nothing to do with line endings, CR and LF are not considered as part of text. If you loop over all paragraphs and print to the console, it resembles the structure of the document, each image occupy an empty line because every image is an empty paragraph, but it's an illusion, because in console each invocation of print automatically opens a new line. If the are printed to a file the results is all paragraphs are crammed into a single line.

To preserve the structure of paragraphs we should add a line ending to each paragraph. In Windows it should be "\r\n", but the result file displays ^M at end of each line, turns out, the Python has the concpet of universal newlines which uses a universal escape to represent new line and is convert to platform specific convention, so if you write \r\n, you end up with \r\r\n. This is a nice feature, because our script will works on every platform without worry about the line ending format, but every programming language has it's own solution of universal line ending, most of them implement it as a system constant, some of them such as Python change the meaning of existing character, this makes the thing more complicated.

Send newline to socket

It looks like we don't need worry about line endings any more in Python. This "universal new line" thing is supposed to make our life easier, in reality, it actually make things more complicated. There is a socket client written Java that will send a header end with two \r\n and then send the content to a server, it works well in Java.

The code looks like this

 
        byte[] body = command.getBytes("UTF-8");
        String headerstring = "Content-Length: " + body.length + "\r\n\r\n";
        outputStream.write(headerstring.getBytes("UTF-8"));
        outputStream.write(body);
        outputStream.flush();
 

When I try to send the same content in Python, the server didn't response anything and the code hangs on the recv invocation. Using Wireshark to capture the data sent by both clients shows that the biggest difference between the two clients is packet sent by Java client ends with 0x0d 0x0a 0x0d 0x0a, but the python client sent 0x0a 0x0a. Changing the "\n\n" in python code to "\r\n\r\n" solves the problem. So if there is "universal new line", why in this case the new line is not converted to the so called "platform-dependent newline"?

Turns out there is a big misunderstanding. The Universal newline support is never supposed to be truly "universal", it actually happens in a very limited situations. The Universal newline support was designed to makes files from different platforms importable. So it is largely about read files.

Here is the documentation:

There is no output implementation of universal newlines, Python programs are expected to handle this by themselves or write files with platform-local convention otherwise. The reason for this is that input is the difficult case, outputting different newlines to a file is already easy enough in Python.

When it comes to write files, the universal support may or may not enabled. There is no clear rules to easily determine whether the support is enabled, but rule of thumb is in many situations, there is no universal support. For example, the following example has no universal support:

 
  >>> def print_hex(bytes):
  ...     l = [hex(int(i)) for i in bytes]
  ...     print(" ".join(l))
  ...
  >>> print_hex("ss\n".encode())
  0x73 0x73 0xa
  >>> print_hex("ss\r\n".encode())
  0x73 0x73 0xd 0xa
 

The universal support should never be a concern for programmers, you should follow the widely accepted conventions that are the same as other language, if things goes wrong, you still have to check case by case, it's much better than having the false assumptions.

The final step is to extract image files and store them to destination folder.

Convert DOCX to HTML

An alternative way to save text and images separately is to simply save your doc or docx as HTML files, what you get are the HTML file and a folder contains all the images referenced by the HTML file.

 
def saveAsHtml(path):
    word               = win32com.client.Dispatch('word.application') 
    word.DisplayAlerts = 0 
    word.visible       = 0
    doc                = word.Documents.Open(path) 
 
    doc.SaveAs(path+".html", 8) 
    doc.Close()
    word.Quit()
 
def extractalldocs():
  docdirectory = "C:\\docs"
  alldocs = os.listdir(docdirectory)
  for doc in alldocs:
      docfullpath = docdirectory + "\\" +doc
      if docfullpath.endswith('doc') or docfullpath.endswith('docx'):
          saveAsHtml(docfullpath)
 

To display on the web, the JPG format is more desirable. The following code convert all extracted PNG images to JPG format.

 
from   PIL import Image
import glob
 
def isvalidimg(imgpath):
    try:
        Image.open(imgpath).verify()
    except:
        return False
    return True
 
def transimg(imgpath):
    if isvalidimg(imgpath):
        try:
            im              = Image.open(imgpath)
            imgnew          = im.convert('RGB')
            imgnew.save(imgpath.rsplit(".", 1)[0] + ".jpg")
            return True
        except Exception as e:
            return False
    else:
        return False
 
def afterextraction():
  for name in glob.glob('C:/docs/*/*.png'):
    transimg(name)