Skip to content Skip to sidebar Skip to footer

Converting Unicode Code Point Numbers To Unicode Characters

I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain 'ordinary' Unicode characters (extended Latin, etc

Solution 1:

What you may want to look at is the raw_unicode_escape encoding.

>>>len(b'\\uffff')
6
>>>b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>>len(b'\\uffff'.decode('raw_unicode_escape'))
1

So, the function would be:

defParseString2Unicode(sInString):
    try:
        decoded = sInString.encode('utf-8')
        return decoded.decode('raw_unicode_escape')
    except UnicodeError:
        return sInString

This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:

import re

escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')

def_escape_sequence_to_char(match):
    returnchr(int(match[0][2:], 16))

defParseString2Unicode(sInString):
    return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)

Post a Comment for "Converting Unicode Code Point Numbers To Unicode Characters"