Use SAX to extract data from XML with Python

There are two approaches to manipulate XML document, the first one is DOM which build the elements tree in memory and use selector or API(getElementByTagName, getChild, etc) to visit elements in the tree, the programmer must specify what nodes they want to visit explicitly.

The second approach is SAX which means Simple API to XML, internally, it will uses a walker to traverse the tree in a depth first manner while parsing, the walker defines a set of API that interact with the node when the node is visited, the programmer provides the callback functions as methods of the walker by extending it. In this approach the walker always visitor all nodes, but the programmer decide which one to react.

In Python SAX module, you will extends the class xml.sax.handler.ContentHandler. It has three interesting methods you may want to overwrite

  • startElement

Invoked when the walker enters an element

  • characters

When the walker traverse between elements or outside any element.

  • endElement

Invoked when the walker leaves an element

Here is an example illustrate it

The test XML file as below

 
<?xml version="1.0"?>
<contents>
  <part title="Emacs">
    not in any element
    <chapter title="Search in Emacs">
      <section type="internal" title="Advanced Search">
        <leaf title="Emacs multi-occur Search All Occurrences and List Search Results" seotitle="emacs-multioccur-search-all-occurrences-and-list-search-results"></leaf>
        <leaf title="Emacs Evil mode regular expression search and replace"  seotitle="emacs-evil-mode-regular-expression-search-and-replace"></leaf>
      </section>
    </chapter>
  </part>  
</contents>
 
 

The Python code

 
import xml.sax.handler
import pprint
 
class MyHandler(xml.sax.handler.ContentHandler):
 
    def startElement(self, name, attributes):
        print("visit start element, name: " + name)
 
    def characters(self, data):
        print("data: --" + data + "--")
 
    def endElement(self, name):
        print("visit end element, name: " + name)
 
 
parser = xml.sax.make_parser()
handler = MyHandler()
parser.setContentHandler(handler)
 
parser.parse("testsax.xml")
 
 

The output:

 
visit start element, name: contents
data: --
--
data: --  --
visit start element, name: part
data: --
--
data: --    not in --
data: --&--
data: -- any element--
data: --
--
data: --    --
visit start element, name: chapter
data: --
--
data: --      --
visit start element, name: section
data: --
--
data: --        --
visit start element, name: leaf
visit end element, name: leaf
data: --
--
data: --        --
visit start element, name: leaf
visit end element, name: leaf
data: --
--
data: --      --
visit end element, name: section
data: --
--
data: --    --
visit end element, name: chapter
data: --
--
data: --  --
visit end element, name: part
data: --  --
data: --
--
visit end element, name: contents
 

As you can see, the characters is invoked everytime when it encountered line break, character stream or entity like ampersand.