Skip to content Skip to sidebar Skip to footer

How To Decode A Non Unicode Character In Python?

I have a string say s = 'Chocolate Moelleux-M\xe8re' When i am doing: In [14]: unicode(s) --------------------------------------------------------------------------- UnicodeDecode

Solution 1:

I have had to face this problem one too many times. The problem that I had contained strings in different encoding schemes. So I wrote a method to decode a string heuristically based on certain features of different encodings.

defdecode_heuristically(string, enc = None, denc = sys.getdefaultencoding()):
    """
    Try to interpret 'string' using several possible encodings.
    @input : string, encode type.
    @output: a list [decoded_string, flag_decoded, encoding]
    """ifisinstance(string, unicode): return string, 0, "utf-8"try:
        new_string = unicode(string, "ascii")
        return string, 0, "ascii"except UnicodeError:
        encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]

        if denc != "ascii": encodings.insert(0, denc)

        if enc: encodings.insert(0, enc)

        for enc in encodings:
            if (enc in ("iso-8859-15", "iso-8859-1") and
                re.search(r"[\x80-\x9f]", string) isnotNone):
                continueif (enc in ("iso-8859-1", "cp1252") and
                re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", string)\
                isnotNone):
                continuetry:
                new_string = unicode(string, enc)
            except UnicodeError:
                passelse:
                if new_string.encode(enc) == string:
                    return new_string, 0, enc

        # If unable to decode,doing force decoding i.e.neglecting those chars.
        output = [(unicode(string, enc, "ignore"), enc) for enc in encodings]
        output = [(len(new_string[0]), new_string) for new_string in output]
        output.sort()
        new_string, enc = output[-1][1]
        return new_string, 1, enc

To add to this this link gives a good feedback on why encoding etc - Why we need sys.setdefaultencoging in py script

Solution 2:

You need to tell s.decode your encoding. In your case s.decode('latin-1') seems fitting.

Post a Comment for "How To Decode A Non Unicode Character In Python?"