Extract text and images from doc or docx file with Python
One of the best things about Python is its amazing ability to work with the platform specific tasks. Especially on Windows, whenever there is task need to access platform facilities, Python is the best choice, and everytime it won't disappoint me. It can interact with win32 API, it can invoke COM objects, which is indispensable when you want to automate some work process to gain productivity.
Suppose you are tasked with the job that given a doc or docx file, you need to post the content to a CMS. You can't just copy and paste, especially the images, instead, you need to save images and upload to the server and insert them into the content. You have to click on every image and save as file. This is a cumbersome and repetitive process. It will be good if we can extract the text and images and store them separately.
Turns out, this can be easily done in Python with a few lines of code as shown below.
import win32com from win32com.client import Dispatch import docx import zipfile import os import shutil def doc2docx(path): word = win32com.client.Dispatch('word.application') word.DisplayAlerts = 0 word.visible = 0 doc = word.Documents.Open(path) doc.SaveAs(path+"x", 12) doc.Close() word.Quit() def extracttext(docxpath): doc = docx.Document(docxpath) fp = open("f:/tempbuffer.txt", "a", encoding="utf-8") for p in doc.paragraphs: fp.write(p.text+u"\n") fp.close() def extractimgs(docxpath, dstpath): doc = zipfile.ZipFile(docxpath) for info in doc.infolist(): if info.filename.endswith((".png", ".jpeg", ".gif")): doc.extract(info.filename, dstpath) shutil.copy(dstpath+"\\"+info.filename, dstpath+"\\"+ docxpath.split("\\")[-1] + info.filename.split("/")[-1]) doc.close() docpath = "my.doc" docxpath = docpath + "x" doc2docx(docpath) extracttext(docxpath) extractimgs(docxpath, "f:\\imgs")
The docx format is just a zipfile which is easier to deal with, so the first step is to convert the file to docx format if we are given a doc file. You can open the file in word processor and click Save As to do it manually, but it will be more convenient if we can do it programatically. With the win32com module, you can do the same thing in Python by invoking the same "Save As" feature through COM object. You can create a COM object by invoking win32com.client.Dispatch with the ProgID or CLSID as parameter. To create a Word object, set the ProgID as word.application. With the COM object we can open a doc file and Save it as docx format.
How to process Python line ending
Programmers deal with text every day everywhere, no matter which programming language is used. There are two issues many have struggled and fretted about, one is encoding, another is line ending. Here we are using Python the language and docx the format, both have their own quirks about line ending we need to deal with.
First we should recap about the fundamentals. The \r is return, its ASCII code is 0x0d, the shorthand is CR, the carriage return, \n is newline, the ASCII code is 0x0a, the shorthand is LF, the line feeding. On a typewriter, carriage return moves horizontally back to the beginning of the current line, while line feed moves vertically to next line. Windows uses \r\n, 0x0d0x0a as line ending, unix uses \n, Mac uses \r.
While reading lines from text file, each line always ends with its own line ending, but reading from format like docx you get paragraph and text, surprisingly, the concept of paragraph and text have nothing to do with line endings, CR and LF are not considered as part of text. If you loop over all paragraphs and print to the console, it resembles the structure of the document, each image occupy an empty line because every image is an empty paragraph, but it's an illusion, because in console each invocation of print automatically opens a new line. If the are printed to a file the results is all paragraphs are crammed into a single line.
To preserve the structure of paragraphs we should add a line ending to each paragraph. In Windows it should be "\r\n", but the result file displays ^M at end of each line, turns out, the Python has the concpet of universal newlines which uses a universal escape to represent new line and is convert to platform specific convention, so if you write \r\n, you end up with \r\r\n. This is a nice feature, because our script will works on every platform without worry about the line ending format, but every programming language has it's own solution of universal line ending, most of them implement it as a system constant, some of them such as Python change the meaning of existing character, this makes the thing more complicated.
The final step is to extract image files and store them to destination folder.
Convert DOCX to HTML
An alternative way to save text and images separately is to simply save your doc or docx as HTML files, what you get are the HTML file and a folder contains all the images referenced by the HTML file.
def saveAsHtml(path): word = win32com.client.Dispatch('word.application') word.DisplayAlerts = 0 word.visible = 0 doc = word.Documents.Open(path) doc.SaveAs(path+".html", 8) doc.Close() word.Quit() def extractalldocs(): docdirectory = "C:\\docs" alldocs = os.listdir(docdirectory) for doc in alldocs: docfullpath = docdirectory + "\\" +doc if docfullpath.endswith('doc') or docfullpath.endswith('docx'): saveAsHtml(docfullpath)
To display on the web, the JPG format is more desirable. The following code convert all extracted PNG images to JPG format.
from PIL import Image import glob def isvalidimg(imgpath): try: Image.open(imgpath).verify() except: return False return True def transimg(imgpath): if isvalidimg(imgpath): try: im = Image.open(imgpath) imgnew = im.convert('RGB') imgnew.save(imgpath.rsplit(".", 1) + ".jpg") return True except Exception as e: return False else: return False def afterextraction(): for name in glob.glob('C:/docs/*/*.png'): transimg(name)