Jan 26, 2012

Python / Django UnicodeEncodeError hacks.

Python plays with unicode nicely nowadays. Bt what if you must deal with old time formats conversion, or ASCII files exporting for e.g. You may also use software that is out of date but is too long to rewrite...  Here often errors occur. I have received mine at copy-pasting from MS Word into Django admin UI  by stupid users. Most of the website played nicely with this fancy characters, but exporting to CSV failed due to non ASCII characters support. Google said nothing special. Python docs about unicode usage briefly cover this type of events. So here is the result of some hours of experiments. I've decided to rewrite some of the python functionality to create decode function with behavior for my needs. Hopefully they will shorten you some time with those collisions you may get in your Django apps...

Anyway I've started to receive errors like:
Exception Type:  UnicodeEncodeError
Exception Value: 'ascii' codec can't encode character u'\u2013' in position 17: ordinal not in range(128)
So I had a list of u'' values that contained special characters " ordinal not in range(128) ". Requires no imports... Pure python:
        values == [
          u'Some fancy text \u2013 something', 
          u'some normal, easy convertible text', 
          u'some more normal text'
          ]      
        HACK: entry cleanup for special characters (Fixing Bug #...)
        # entry cleanup for special characters
        i = 0
        for value in values:
            try:
                # if string can be encoded to 'ascii' pass
                unicode(value).encode('ascii')
            except UnicideEncodeError:
                val_temp = unicode(value)
                # cleaning up string with escaping non convertible characters
                result = []
                for symbol in val_temp:
                    try:
                        symbol.encode('ascii')
                        result.append(symbol)
                    except UnicodeEncodeError:
                        pass
                # rewriting wrong value in values array
                val_temp = ''.join(result)
                values[i] = val_temp
                pass
            i = i+1
        # normally work with our list... it's safe now...
        values == [
          u'Some fancy text  something', 
          u'some normal, easy convertible text', 
          u'some more normal text'
          ]
This code is a bit complicated due to mine specific task and has iterations in iterations etc... But it's from a working app and checked working. However here is the theoretical example that must clean up a single string:
value = u'Some fancy text \u2013 something'
try:
    # if string can be encoded to 'ascii' pass
    value.encode('ascii')
except:
    # cleaning up string with escaping non convertible characters
    result = []
    for symbol in val_temp:
        try:
            symbol.encode('ascii')
            result.append(symbol)
        except UnicodeEncodeError:
            pass
    # rewriting our variable with safe one
    value = ''.join(result)
    pass
# normally work with our unicode string... it's safe now...
value = u'Some fancy text  something
So the technique here is simple. We are checking if this unicode string can be converted to 'ascii' python encoding without errors we simply passing through. And if it's not... Converting it to 'ascii' string symbol by symbol. Symbols that will fail will be gracefully omitted. You can create a function from all of this, like 'my_decode_cleanup' or something and use whenever needed...

Hope this will help you to save some precious time during your python development.

Helped? I'm wrong somewhere? Please comment!

3 comments:

  1. Hi,

    .encode() method could handle this for you:

    >>> a
    u'Some fancy text \u2013 something'

    >>> a.encode()
    Traceback (most recent call last):
    File "", line 1, in
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 16: ordinal not in range(128)

    >>> a.encode(errors='ignore')
    'Some fancy text something'

    ReplyDelete
  2. As Igor said above, encode(errors='ignore') is the preferred way of doing this.

    But on a more fundamental level, try to avoid converting things to ASCII unless absolutely necessary. In your case, where you're writing to a CSV file, the actual problem you have is that the Python "csv" library doesn't support Unicode out of the box. However, it's easy enough to make it output UTF-8-ecoded CSV files. The Python docs even give you the code you need to do it, see the bottom of: http://docs.python.org/library/csv.html#examples

    If it is *absolutely* necessary to convert to ASCII, consider using the unidecode library ( http://pypi.python.org/pypi/Unidecode/ ), which will convert things more nicely, e.g. converting 'résumé' to 'resume' instead of 'rsum', like your code or encode(errors='ignore') would.

    Some other notes:

    The Pythonic way to do this:

    i = 0
    for value in values:
    . # Your code
    . i += 1

    is this:

    for i, value in enumerate(values):
    . # Your code

    Also, you should never have an "except:" without specifying an exception type, since it will catch things you definitely don't want to catch, like KeyboardInterrupt and SystemExit.

    And finally, the functionality of what you wrote can be replicated with this one-liner:

    >>> ''.join(letter for letter in u'bl\u2013ah if ord(letter) <= 127)
    u'blh'

    ReplyDelete
  3. Wow cool... thanks guys! I'll update the article during this weekend... I've implemented it (finally) the way you tell "using the unidecode library"...

    Anyway 1 liner is nice thanks... Just maybe I'm junior, so I prefer many lines of code...

    Also many many thanks with CSV info... I've not thought, I'm using "not standard" python CSV export library in fact. ;)

    ReplyDelete