Convert Utf-8 Unicode Sequence To Utf-8 Chars In Python 3
I'm reading data from an aws s3 bucket which happens to have unicode chars escaped with double backslashes. The double backslashes makes the unicode sequence parsed as a series of
Solution 1:
I believe that the codecs
module provides this utility:
>>>import codecs>>>codecs.decode("1+1\\u003d2", encoding='unicode_escape')
'1+1=2'
This probably points to a larger problem, though. How do these strings come to be in the first place?
Note, if this is being extracted from a valid JSON string (in this case it would be missing the quotes), you could simply use:
>>>import json>>>json.loads('"1+1\\u003d2"')
'1+1=2'
Solution 2:
I'm also adding a variant of juanpa.arrivillaga solution which also handles surrogate escape.
>>>import codecs>>>s1="A surrogate sequence \\ud808\\udf45">>>print(codecs.decode(s1, encoding='unicode_escape'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 21-22: surrogates not allowed
>>>print(codecs.decode(s1,encoding='unicode_escape',errors='surrogateescape').encode('utf-16', 'surrogatepass').decode('utf-16'))
A surrogate sequence 𒍅
Post a Comment for "Convert Utf-8 Unicode Sequence To Utf-8 Chars In Python 3"