Skip to content Skip to sidebar Skip to footer

Parsing Invalid Unicode Json In Python

i have a problematic json string contains some funky unicode characters 'test':{'foo':'Ig0s\x5C/k\x5C/4jRk'}} and if I convert using python import json s = r'{'test':{'foo':'Ig0s\

Solution 1:

If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:

>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'>>> json.loads(s.decode('string-escape')) 
{u'test': {u'foo': u'Ig0s/k/4jRk'}}

Solution 2:

You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():

>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}

The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:

>>>print _['test']['foo']
Ig0s\/k\/4jRk

This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).

This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.

Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:

import re

escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')

defrepair(string):
    return escape_sequence.sub(r'\\u00\1', string)

Demo:

>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}

If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.

Post a Comment for "Parsing Invalid Unicode Json In Python"