Python, especially version 2 is notorious for unicode decoding and encoding problem. I had been bitten by it in the era of Python2, and its painful . The unicode API in Python 2 is borken, inconsisitent and confusing.

Now we using Python 3. It supposed to be better. But I still hit the brick wall in Python 3. I have a little interest in figuring out whats really going on in most cases, the only thing want is to find a quick solution to make it stop throwing any encoding exceptions.

I have a string looks like this

 
how to create a windows service in python « Python recipes « ActiveState Code
 

Actually its a string comes from the Web. When print it to the console, it gives me this error

 
UnicodeEncodeError: 'gbk' codec can't encode character '\xab' in position 42: illegal multibyte sequence

What happened? The string in Python 3 code now all unicode, now we want output it to console, and the console has its own character encoding, here is 'gbk' for example. There is such a character « which has no correspond character in console's character set. And the print function must first convert the unicode to 'gbk', and its failed in this step. Python simply throw an exception.

In most other languages, this situation will not be considered as an error, they just encode it anyway, even the output is nonsense gibberish text. And you know what? this is done right in most cases. Wrong encoding is the fault of programmer, and it will be noticed instantly, its unnecessary to throw exception.

Maybe the reason is according to the philosophy of Python that explicit is better than implicit , Python think when the encoding is wrong, it should explicitly report it.

If what you want is let code keep going and ignore the wrong encoding, try this solution

 
print ("title: " , unicode_str.encode('gbk', 'backslashreplace').decode('gbk', 'backslashreplace'))
 

Now the output will looks like this

 
title:  how to create a windows service in python \xab Python recipes \xab ActiveState Code - Google Chrome
 

The first encode will convert unicode to gbk but the special character will be replaced by slash representation, and then convert back to unicode, now the new unicode string will not contain any special character and can be safely sent to console.

Actually there is another way can achieve the same effect: by default , print will use encode function to do encoding, encode has a parameter which decide how it should deal with encoding error, the default value is 'strict', we can change this default configuration. But this will affect all function that depend on encode function. You probably don't want this if you just want the print works well.