Python Get Character Code In Different Encoding?
Solution 1:
UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr()
to convert the character code to a character, decode it, and use ord()
to get the code point.
>>>ord(chr(145).decode('koi8-r'))
9618
Solution 2:
You can only map an "integer number" from one encoding to another if they are both single-byte encodings.
Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):
>>>s = u'€'>>>s.encode('iso-8859-15')
'\xa4'
>>>s.encode('cp1252')
'\x80'
>>>ord(s.encode('cp1252'))
128
>>>ord(s.encode('iso-8859-15'))
164
Note that ord
is here being used to get the ordinal number of the encoded byte. Using ord
on the original unicode string would give its unicode code point:
>>>ord(s)
8364
The reverse operation to ord
can be done using either chr
(for codes in the range 0
to 127
) or unichr
(for codes in the range 0
to sys.maxunicode
):
>>>printchr(65)
A
>>>print unichr(8364)
€
For multi-byte encodings, a simple "integer number" mapping is usually not possible.
Here's the same example as above, but using "iso-8859-15" and "utf-8":
>>>s = u'€'>>>s.encode('iso-8859-15')
'\xa4'
>>>s.encode('utf-8')
'\xe2\x82\xac'
>>>[ord(c) for c in s.encode('iso-8859-15')]
[164]
>>>[ord(c) for c in s.encode('utf-8')]
[226, 130, 172]
The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).
Solution 3:
Here's an example of how the encode/decode dance works:
>>>s = b'd\x06'# perhaps start with bytes encoded in utf-16>>>map(ord, s) # show those bytes as integers
[100, 6]
>>>u = s.decode('utf-16') # turn the bytes into unicode>>>print u # show what the character looks like
٤
>>>printord(u) # show the unicode code point as an integer
1636
>>>t = u.encode('utf-8') # turn the unicode into bytes with a different encoding>>>map(ord, t) # show that encoding as integers
[217, 164]
Hope this helps :-)
If you need to construct the unicode directly from an integer, use unichr:
>>>u = unichr(1636)>>>print u
٤
Post a Comment for "Python Get Character Code In Different Encoding?"