Skip to main content

Python / Django UnicodeEncodeError hacks.

Python plays with unicode nicely nowadays. Bt what if you must deal with old time formats conversion, or ASCII files exporting for e.g. You may also use software that is out of date but is too long to rewrite...  Here often errors occur. I have received mine at copy-pasting from MS Word into Django admin UI  by stupid users. Most of the website played nicely with this fancy characters, but exporting to CSV failed due to non ASCII characters support. Google said nothing special. Python docs about unicode usage briefly cover this type of events. So here is the result of some hours of experiments. I've decided to rewrite some of the python functionality to create decode function with behavior for my needs. Hopefully they will shorten you some time with those collisions you may get in your Django apps...

Anyway I've started to receive errors like:
Exception Type:  UnicodeEncodeError
Exception Value: 'ascii' codec can't encode character u'\u2013' in position 17: ordinal not in range(128)
So I had a list of u'' values that contained special characters " ordinal not in range(128) ". Requires no imports... Pure python:
        values == [
          u'Some fancy text \u2013 something', 
          u'some normal, easy convertible text', 
          u'some more normal text'
          ]      
        HACK: entry cleanup for special characters (Fixing Bug #...)
        # entry cleanup for special characters
        i = 0
        for value in values:
            try:
                # if string can be encoded to 'ascii' pass
                unicode(value).encode('ascii')
            except UnicideEncodeError:
                val_temp = unicode(value)
                # cleaning up string with escaping non convertible characters
                result = []
                for symbol in val_temp:
                    try:
                        symbol.encode('ascii')
                        result.append(symbol)
                    except UnicodeEncodeError:
                        pass
                # rewriting wrong value in values array
                val_temp = ''.join(result)
                values[i] = val_temp
                pass
            i = i+1
        # normally work with our list... it's safe now...
        values == [
          u'Some fancy text  something', 
          u'some normal, easy convertible text', 
          u'some more normal text'
          ]
This code is a bit complicated due to mine specific task and has iterations in iterations etc... But it's from a working app and checked working. However here is the theoretical example that must clean up a single string:
value = u'Some fancy text \u2013 something'
try:
    # if string can be encoded to 'ascii' pass
    value.encode('ascii')
except:
    # cleaning up string with escaping non convertible characters
    result = []
    for symbol in val_temp:
        try:
            symbol.encode('ascii')
            result.append(symbol)
        except UnicodeEncodeError:
            pass
    # rewriting our variable with safe one
    value = ''.join(result)
    pass
# normally work with our unicode string... it's safe now...
value = u'Some fancy text  something
So the technique here is simple. We are checking if this unicode string can be converted to 'ascii' python encoding without errors we simply passing through. And if it's not... Converting it to 'ascii' string symbol by symbol. Symbols that will fail will be gracefully omitted. You can create a function from all of this, like 'my_decode_cleanup' or something and use whenever needed...

Hope this will help you to save some precious time during your python development.

Helped? I'm wrong somewhere? Please comment!

Comments

  1. Hi,

    .encode() method could handle this for you:

    >>> a
    u'Some fancy text \u2013 something'

    >>> a.encode()
    Traceback (most recent call last):
    File "", line 1, in
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 16: ordinal not in range(128)

    >>> a.encode(errors='ignore')
    'Some fancy text something'

    ReplyDelete
  2. As Igor said above, encode(errors='ignore') is the preferred way of doing this.

    But on a more fundamental level, try to avoid converting things to ASCII unless absolutely necessary. In your case, where you're writing to a CSV file, the actual problem you have is that the Python "csv" library doesn't support Unicode out of the box. However, it's easy enough to make it output UTF-8-ecoded CSV files. The Python docs even give you the code you need to do it, see the bottom of: http://docs.python.org/library/csv.html#examples

    If it is *absolutely* necessary to convert to ASCII, consider using the unidecode library ( http://pypi.python.org/pypi/Unidecode/ ), which will convert things more nicely, e.g. converting 'résumé' to 'resume' instead of 'rsum', like your code or encode(errors='ignore') would.

    Some other notes:

    The Pythonic way to do this:

    i = 0
    for value in values:
    . # Your code
    . i += 1

    is this:

    for i, value in enumerate(values):
    . # Your code

    Also, you should never have an "except:" without specifying an exception type, since it will catch things you definitely don't want to catch, like KeyboardInterrupt and SystemExit.

    And finally, the functionality of what you wrote can be replicated with this one-liner:

    >>> ''.join(letter for letter in u'bl\u2013ah if ord(letter) <= 127)
    u'blh'

    ReplyDelete
  3. Wow cool... thanks guys! I'll update the article during this weekend... I've implemented it (finally) the way you tell "using the unidecode library"...

    Anyway 1 liner is nice thanks... Just maybe I'm junior, so I prefer many lines of code...

    Also many many thanks with CSV info... I've not thought, I'm using "not standard" python CSV export library in fact. ;)

    ReplyDelete

Post a Comment

Popular posts from this blog

Pretty git Log

SO you dislike git log output in console like me and do not use it... Because it looks like so: How about this one? It's quite easy... Just type: git log - - graph - - pretty = format : '%Cred%h%Creset -%C ( yellow ) %d%Creset %s %Cgreen ( %cr) %C ( bold blue ) <%an>%Creset' - - abbrev - commit - - It may be hard to enter such an easy command every time. Let's make an alias instead... Copypaste this to your terminal: git config --global alias.lg "log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --" And use simple command to see this pretty log instead: git lg Now in case you want to see lines that changed use: git lg - p In order for this command to work remove  the -- from the end of the alias. May the code be with you! NOTE: this article is a rewritten copy of  http://coderwall.com/p/euwpig?i=3&p=1&t=git   and have b...

Django: Resetting Passwords (with internal tools)

I have had a task recently. It was about adding a forms/mechanism for resetting a password in our Django based project. We have had our own registration system ongoing... It's a corporate sector project. So you can not go and register yourself. Admins (probably via LDAP sync) will register your email/login in system. So you have to go there and only set yourself a password. For security reasons you can not register. One word. First I've tried to find standart decision. From reviewed by me were: django-registration and django password-reset . These are nice tools to install and give it a go. But I've needed a more complex decision. And the idea was that own bicycle is always better. So I've thought of django admin and that it has all the things you need to do this yourself in no time. (Actually it's django.contrib.auth part of django, but used out of the box in Admin UI) You can find views you need for this in there. they are: password_reset password_reset_...

Vagrant error: * Unknown configuration section 'hostmanager'.

Sometimes you get a vagrant environment or boilerplate with a Vagrantfile config in there and do a vagrant up command. And see some errors. like this: There are errors in the configuration of this machine . Please fix the following errors and try again : Vagrant: * Unknown configuration section 'hostmanager'. To fix this one needs: $ vagrant plugin install vagrant - hostmanager Installing the ' vagrant-hostmanager ' plugin . This can take a few minutes . . . Fetching : vagrant - hostmanager - 1.8 .6 . gem ( 100 % ) Installed the plugin ' vagrant-hostmanager (1.8.6) ' ! So command to fix this as follows: vagrant plugin install vagrant-hostmanager