Python Why lxml etree tostring method returns bytes

When using lxml library in Python, I found the API a little bit strange. For example when I want to use the etree.tostring method to print the content of the parsed HTML, it returns bytes.

 
from lxml import etree
print(etree.tostring(html))
 

It prints things like this

 
b'<!DOCTYPE html PUBLIC "-//W3C//DTD ...
 

To print it as string we need to decode it with an encoding

 
print(etree.tostring(html).decode('utf-8'))
 

But the method name indicates it should return a string. Why this design? According to https://bugs.python.org/issue10942. This method returns string when the parameter encoding = "unicode" is specified, otherwise it returns bytes.

Here is the documentation.

 
xml.etree.ElementTree.tostring(element, encoding='us-ascii', method='xml', *, xml_declaration=None, default_namespace=None, short_empty_elements=True)
 
"""Generates a string representation of an XML element, including all subelements.
element is an Element instance.
encoding is the output encoding (default is US-ASCII).
Use encoding="unicode" to generate a Unicode string (otherwise, a bytestring is generated).
method is either "xml", "html" or "text" (default is "xml")."""
 

Thus to get a string as indicated by the method name:

 
print(etree.tostring(html, encoding = "unicode"))
 

The reason behind this bizarre behavior is a complicated long story. Here is the short version: it's for the compatibility to the mess of unicode in Python 2.X. Only those who have been solely working with Python 3.X ever since their first met with Python and know little about Python 2.X would be surprised. Technically, for a Python 2.X programmer, returning a byte is totally valid because in Python 2.X, a string is indeed a byte sequence, and is aptly called byte string. It's makes no sense in Python 3.x and later because Python 3.X fixed the mess and followed the conventions adopted by all other main stream languages, that is, a string is always unicode. The string in the method name has a double meaning, it can be Python 2.X string or Python 3.X string according to parameters.

See more

How to install lxml for Python 3.4.3 on Windows

Start parsing XML with Python and lxml:How to parse XML with Python and lxml.