Skip to content Skip to sidebar Skip to footer

SQLAlchemy: Dealing With CP-1252 Data When Python Is Expecting It To Be UTF-8

I am working with an existing SQLite database and experiencing errors due to the data being encoded in CP-1252, when Python is expecting it to be UTF-8. >>> import sqlite3

Solution 1:

SQLAlchemy and SQLite are behaving normally. The solution is to fix the non-UTF-8 data in the database.

I wrote the below, drawing inspiration from https://stackoverflow.com/a/2395414/1191425 . It:

  • loads up the target SQLite database
  • lists all columns in all tables
  • if the column is a text, char, or clob type - including variants like varchar and longtext - it re-encodes the data from the INPUT_ENCODING to UTF-8.

INPUT_ENCODING = 'cp1252' # The encoding you want to convert from
import sqlite3
db = sqlite3.connect('dnd_fixed.sqlite')
db.create_function('FIXENCODING', 1, lambda s: str(s).decode(INPUT_ENCODING))
cur = db.cursor()
tables = cur.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
tables = [t[0] for t in tables]
for table in tables:
    columns = cur.execute('PRAGMA table_info(%s)' % table ).fetchall() # Note: pragma arguments can't be parameterized.
    for column_id, column_name, column_type, nullable, default_value, primary_key in columns:
        if ('char' in column_type) or ('text' in column_type) or ('clob' in column_type):
            # Table names and column names can't be parameterized either.
            db.execute('UPDATE "{0}" SET "{1}" = FIXENCODING(CAST("{1}" AS BLOB))'.format(table, column_name))

After this script runs, all *text*, *char*, and *clob* fields are in UTF-8 and no more Unicode decoding errors will occur. I can now Faerûn to my heart's content.


Solution 2:

If you have a connection URI then you can add the following options to your DB connection URI:

DB_CONNECTION = mysql+pymysql://{username}:{password}@{host}/{db_name}?{options}
DB_OPTIONS = {
    "charset": "cp-1252",
    "use_unicode": 1,
}
connection_uri = DB_CONNECTION.format(
    username=???,
    ...,
    options=urllib.urlencode(DB_OPTIONS)        
)

Assuming your SQLLite driver can handle those options (pymysql can, but I don't know 100% about sqllite), then your queries will return unicode strings.


Post a Comment for "SQLAlchemy: Dealing With CP-1252 Data When Python Is Expecting It To Be UTF-8"