Convert File To Ascii Is Throwing Exceptions
Solution 1:
Use:
contentOfFile.decode('utf-8', 'ignore')
The exception is from the decode phase, where you didn't ignore the error.
Solution 2:
0x00f6
is รถ
(ouml ) encoded in ISO-8859-1
. My guess is you're using the wrong Unicode decoder.
Try : unicodeData = contentOfFile.decode("ISO-8859-1")
Solution 3:
You don't need to load the whole file in memory and call .decode()
on it. open()
has encoding
parameter (use io.open()
on Python 2):
withopen(filename, encoding='ascii', errors='ignore') as file:
ascii_char = file.read(1)
If you need an ascii transliteration of Unicode text; consider unidecode
.
Solution 4:
I really don't care about data loss converting them to ASCII. ... How can I convert such files to contain only the pure Ascii characters?
One way is to use the replace option for the decode
method. The advantage of replace over ignore is that you get placeholders for missing values which my help prevent a misinterpretation of the text.
Be sure to use ASCII encoding rather than UTF-8. Otherwise, you may lose adjacent ascii characters as the decoder attempts to re-sync.
Lastly, run encode('ascii')
after the decoding step. Otherwise, you're left with a unicode string instead of a byte string.
>>>string_of_unknown_encoding = 'L\u00f6wis'.encode('latin-1')>>>now_in_unicode = string_of_unknown_encoding.decode('ascii', 'replace')>>>back_to_bytes = now_in_unicode.replace('\ufffd', '?').encode('ascii')>>>type(back_to_bytes)
<class 'bytes'>
>>>print(back_to_bytes)
b'L?wis'
That said, TheRightWay™ to do this is to start caring about data loss and use the correct encoding (clearly your input isn't in UTF-8 otherwise the decoding wouldn't have failed):
>>>string_of_known_latin1_encoding = 'L\u00f6wis'.encode('latin-1')>>>now_in_unicode = string_of_known_latin1_encoding.decode('latin-1')>>>back_to_bytes = now_in_unicode.encode('ascii', 'replace')>>>type(back_to_bytes)
<class 'bytes'>
>>>print(back_to_bytes)
Post a Comment for "Convert File To Ascii Is Throwing Exceptions"