Skip to content Skip to sidebar Skip to footer

Python Get Character Code In Different Encoding?

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?

Solution 1:

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.

>>>ord(chr(145).decode('koi8-r'))
9618

Solution 2:

You can only map an "integer number" from one encoding to another if they are both single-byte encodings.

Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):

>>>s = u'€'>>>s.encode('iso-8859-15')
'\xa4'
>>>s.encode('cp1252')
'\x80'
>>>ord(s.encode('cp1252'))
128
>>>ord(s.encode('iso-8859-15'))
164

Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:

>>>ord(s)
8364

The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):

>>>printchr(65)
A
>>>print unichr(8364)

For multi-byte encodings, a simple "integer number" mapping is usually not possible.

Here's the same example as above, but using "iso-8859-15" and "utf-8":

>>>s = u'€'>>>s.encode('iso-8859-15')
'\xa4'
>>>s.encode('utf-8')
'\xe2\x82\xac'
>>>[ord(c) for c in s.encode('iso-8859-15')]
[164]
>>>[ord(c) for c in s.encode('utf-8')]
[226, 130, 172]

The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).

Solution 3:

Here's an example of how the encode/decode dance works:

>>>s = b'd\x06'# perhaps start with bytes encoded in utf-16>>>map(ord, s)              # show those bytes as integers
[100, 6]
>>>u = s.decode('utf-16')   # turn the bytes into unicode>>>print u                  # show what the character looks like
٤
>>>printord(u)             # show the unicode code point as an integer
1636
>>>t = u.encode('utf-8')    # turn the unicode into bytes with a different encoding>>>map(ord, t)              # show that encoding as integers
[217, 164]

Hope this helps :-)

If you need to construct the unicode directly from an integer, use unichr:

>>>u = unichr(1636)>>>print u
٤

Post a Comment for "Python Get Character Code In Different Encoding?"