Parsing Invalid Unicode Json In Python
Solution 1:
If the rest of the string apart from invalid \x5c
is a JSON then you could use string-escape
encoding to decode `'\x5c into backslashes:
>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'>>> json.loads(s.decode('string-escape'))
{u'test': {u'foo': u'Ig0s/k/4jRk'}}
Solution 2:
You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval()
:
>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}
The \x5C
is a single backslash, doubled in the Python literal string representation here. The actual string value is:
>>>print _['test']['foo']
Ig0s\/k\/4jRk
This parses the input as Python source, but only allows for literal values; strings, None
, True
, False
, numbers and containers (lists, tuples, dictionaries).
This method is slower than json.loads()
because it does part of the parse-tree processing in pure Python code.
Another approach would be to use a regular expression to replace the \xhh
escape codes with JSON \uhhhh
codes:
import re
escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')
defrepair(string):
return escape_sequence.sub(r'\\u00\1', string)
Demo:
>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}
If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.
Post a Comment for "Parsing Invalid Unicode Json In Python"