Converting Unicode Code Point Numbers To Unicode Characters
I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain 'ordinary' Unicode characters (extended Latin, etc
Solution 1:
What you may want to look at is the raw_unicode_escape
encoding.
>>>len(b'\\uffff')
6
>>>b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>>len(b'\\uffff'.decode('raw_unicode_escape'))
1
So, the function would be:
defParseString2Unicode(sInString):
try:
decoded = sInString.encode('utf-8')
return decoded.decode('raw_unicode_escape')
except UnicodeError:
return sInString
This, however, also matches other unicode escape sequences, like \Uxxxxxxxx
. If you just want to match \uxxxx
, use a regex, like so:
import re
escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')
def_escape_sequence_to_char(match):
returnchr(int(match[0][2:], 16))
defParseString2Unicode(sInString):
return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)
Post a Comment for "Converting Unicode Code Point Numbers To Unicode Characters"