For many Python beginners , the encode decode and unicode stuff is always very annoying. This guide aims to make things clear.

bytestring and unicode

If you know a little about Java, you should know that internally JVM represent all strings as unicode, any raw data like text file on disk or data comes from network called byte stream. Python is similar. The difference is bytestring and unicode both are string in Python and in Java only unicode can be String.

Bytestring, as the name indicate, is just stream of bytes, the sequence format of the stream is the encoding of the bytes. Common encodings include utf8, ascii, gb2312, gbk...

When you know the encoding of the bytestring, you can convert it to unicode, this unicode string then can be converted to other encoding bytestring.

encode and decode

encode: you can only encode unicode string, the result is the bytestring encoded from unicode string with your encoding. For example, you have a unicode string, you can encode it to utf-8:

unistr =  u'Ancient Aliens \u2014 Episodes, Video & Schedule - H2 on'
utfstr = unistr.encode('utf-8')

decode: is the reverse of encode, convert a bytestring to unicode string.

unistr = utfstr.decode('utf-8')

Common errors

UnicodeEncodeError: 'ascii' codec can't encode character. Sometimes, the error will looks very weird:

aaa = u'Ancient Aliens \u2014 Episodes, Video & Schedule - H2 on'
print aaa.decode('utf-8') 

Run the script you will get the error above, its weird because we are decode string, the error is an encode error.

Here we have a unicode string which should not be decoded, only the bytestring can be decoded. Python know the string is unicode, and the script try to decode it, so at the first Python try to encode the unicode to bytestring, obviously Python use the 'ascii' encoding to encode it. This encoding is the default bytestring encoding for Python, but ascii encoding only contains 0 - 128 which can not encode \u2014.


Unicode in the real world

Python UnicodeEncodeError: 'ascii' codec can't encode character

Making Sense of Python Unicode