How to use XPath syntax example with Python and lxml

As XML and it's dialects like HTML becoming the de facto standard format for various documents such as configuration file for framework like Spring, Struts, for IDE project like Visual Studio, response of web service, UI layout language for Android, etc. Parsing and manipulating them has become must-have component for any programming infrastructure. Because the ubiquitousness of the format, it's almost unavoidable that programmers will encounter them here and there. XML is to data is just what AST is to code. XML grasps the essence or semantic of data, it's an ideal universal language to describe data and information, it will be everywhere, literally. The ability to read, parse, navigate, manipulate XML document is a must-have skill for any programmer.

The lxml is a Pythonic binding for the C libraries libxml2 and libxslt which quite easy to use. For simple query like finding a tag, you can use findtext, but for complex query, you need more powerful tool. Here XPath comes to rescue, XPath is a mini language allows you to specify how to select elements in an XML document in a declarative way. In some ways it is similar to CSS selector which is used to select DOM elements, XPath allows you navigate through elements and attributes in an XML document just like CSS. A path identifies a set of nodes include elements, attributes, text, etc.

XPath was heavily used in XSLT and XSL, because all it does is to locate the element in the input document and apply a transformation on it to generate result in output document, just like the CSS using selector to locate element in HTML document and apply styles on it.

In this post we will illustrate the syntax of XPath with an example HTML file and lxml api.

Create example HTML file

Create a HTML file xpath-test.html with the following content

         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-highlight"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>

Read the file and parse it with lxml.html module

import lxml.html
html = lxml.html.parse("xpath-test.html")

Child operator /

Use this operator to separate parent and child.

When the child operator is the first character of the path, it means this is an absolute path, that is, start from root node.

Find the div node using absolute path

nodes = html.xpath('/html/body/div')

Notice the html parser will fix your document by adding the missing html and body, the root node should be html. You can print the HTML source that being used

from lxml import etree

The output will be

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
         <li class="item-0"><a href="link1.html">first item</a></li>&#13;
         <li class="item-1"><a href="link2.html">second item</a></li>&#13;
         <li class="item-inactive"><a href="link3.html">third item</a></li>&#13;
         <li class="item-1"><a href="link4.html">fourth item</a></li>&#13;
         <li class="item-0"><a href="link5.html">fifth item</a>&#13;

Select all link's text

nodes = html.xpath('/html/body/div/ul/li/a/text()')


['first item', 'second item', 'third item', 'fourth item', 'fifth item']

Descendant operator //

Select descendants of the parent node, when the first, means search start from root

Modify the HTML as follows

         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
        <li><a href="link6.html">sixth item</a></li>

Get all links

lis = html.xpath('//li/a/text()')


['first item', 'second item', 'third item', 'fourth item', 'fifth item', 'sixth item']

Use absolute path will need two queries, with descendant operator only need one.

Only gets the links in second ul list

lis = html.xpath('/html/body/div/div//li/a/text()')
print (lis)
['sixth item']

Attribute @ operator

Filter elements by their attribute, for example the class attribute

Select all elements with class attribute as "item-0":

attributes = html.xpath('//li[@class="item-0"]/a/text()')
print (attributes)
['first item', 'fifth item']

Logical operator

Logical query returns a boolean, if the matched elements match the query , returns true, otherwise false.

Test whether the link in second list contains links that the text content equals "first item"

logics = html.xpath("//div/div//a/text()='first item'")
print (logics)