If you’re working on any old PHP/MySQL sites, chances are at some point you’re going to need to get into the murky, painful world of character encoding – presumably to convert everything to UTF-8 from whatever original setup you have. It is not fun, but fortunately many people have gone through it before and paved the way with a collection of useful information and scripts.
One problem which struck us recently when migrating our database server was certain characters being “double encoded”. This appears to be relatively common. For us, the cause was exporting our data – all UTF-8 data but stored in tables that were latin1 – via mysqldump and then importing again as if it was UTF-8. This means something like the characters are detected as multibyte, but because the source and destinations were different, they’re re-encoded – so you end up with these double encoded characters that look like squiggly gibberish appearing in all your web pages.
Nathan over at the Blue Box Group has written an extremely comprehensive guide to problems like this. It explains the root cause of these problems, the common symptoms, and – of course, most importantly – precise details on how to safely fix them. If you’re doing anything at all involved in changing character encoding then it is worth a read even before you have problems, just so you can get a better handle on how to fix things and what your end game should be.
There’s a few other ways to fix it, of course. The Blue Box solution is comprehensive and reliable but it requires quite a bit of work to get it going, and you also need to know which database table fields you want to work on specifically – so it can be time consuming unless you’re prepared to really sit down and work on it, either to process everything manually or write a script to do it all for you.
Fortunately there’s an easier way, as described here – basically, all you need to do is export your current dataset with mysqldump, forcing it to latin1, and then re-import it as UTF-8:
mysqldump -h DB_HOST -u DB_USER -p –opt –quote-names –skip-set-charset –default-character-set=latin1 DB_NAME > DB_NAME-dump.sql
mysql -h DB_HOST -u DB_USER -p –default-character-set=utf8 DB_NAME < DB_NAME-dump.sql
We did this for AusGamers.com and it worked perfectly – the only caveat you need to be aware of is that it will mess up UTF-8 characters that are properly encoded aleady. For us this wasn’t a big deal as we were able to clearly identify them and fix them manually.
StackOverflow has yet another approach which might be suitable if you’re dealing with only one or two tables and just want to fix it from the MySQL console or phpMyAdmin or whatever – changing the table character sets on the fly:
ALTER TABLE [tableName] MODIFY [columnName] [columnType] CHARACTER SET latin1
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET binary
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET utf8
This method worked fine for me in a test capacity on a single table but we didn’t end up using it everywhere.