Python3 Qt Unicode File Name Problems
Solution 1:
You're right, 123c
is just wrong. The evidence shows that the filename on disk contains an invalid Unicode codepoint U+DCB4. When Python tries to print that character, it rightly complains that it can't. When Qt processes the character in test4 it can't handle it either, but instead of throwing an error it converts it to the Unicode REPLACEMENT CHARACTER U+FFFD. Obviously the new filename no longer matches what's on disk.
Python can also use the replacement character in a string instead of throwing an error if you do the conversion yourself and specify the proper error handling. I don't have Python 3 on hand to test this but I think it will work:
filename = filename.encode('utf-8').decode('utf-8', 'replace')
Solution 2:
Codes like "\udcb4" come from surrogate escape. It's a way for Python to preserve bytes that cannot be interpreted as valid UTF-8. When encoded to UTF-8, surrogates are turned into bytes without the 0xDC byte, so "\udcb4" becomes 0xB4. Surrogate escape makes it possible to deal with any byte sequences in file names. But you need to be careful to use errors="surrogateescape" as documented in the Unicode HOWTO https://docs.python.org/3/howto/unicode.html
Solution 3:
Python2 vs Python3
python
Python 2.7.4 (default, Sep 26 2013, 03:20:56)
>>> import os
>>> os.listdir('.')
['unicode.py', '123c\xb4.wav', '123b\xc3\x86.wav', '123a\xef\xbf\xbd.wav']
>>> os.path.exists(u'123c\xb4.wav')
False
>>> os.path.exists('123c\xb4.wav')
True
>>> n ='123c\xb4.wav'
>>> print(n)
123c�.wav
>>> n =u'123c\xb4.wav'
>>> print(n)
123c´.wav
That backtick on the last line above is what I've been looking for! ..vs that �
The same directory listed with Python3 shows a different set of filenames
python3
Python 3.3.1 (default, Sep 25 2013, 19:30:50)
>>> import os
>>> os.listdir('.')
['unicode.py', '123c\udcb4.wav', '123bÆ.wav', '123a�.wav']
>>> os.path.exists('123c\udcb4.wav')
True
Is this a bug in Python3?
Post a Comment for "Python3 Qt Unicode File Name Problems"