Fixing Double Encoded Characters in MySQL

If you’re working on any old PHP/MySQL sites, chances are at some point you’re going to need to get into the murky, painful world of character encoding – presumably to convert everything to UTF-8 from whatever original setup you have. It is not fun, but fortunately many people have gone through it before and paved the way with a collection of useful information and scripts.

One problem which struck us recently when migrating our database server was certain characters being “double encoded”. This appears to be relatively common. For us, the cause was exporting our data – all UTF-8 data but stored in tables that were latin1 – via mysqldump and then importing again as if it was UTF-8. This means something like the characters are detected as multibyte, but because the source and destinations were different, they’re re-encoded – so you end up with these double encoded characters that look like squiggly gibberish appearing in all your web pages.

Nathan over at the Blue Box Group has written an extremely comprehensive guide to problems like this. It explains the root cause of these problems, the common symptoms, and – of course, most importantly – precise details on how to safely fix them. If you’re doing anything at all involved in changing character encoding then it is worth a read even before you have problems, just so you can get a better handle on how to fix things and what your end game should be.

There’s a few other ways to fix it, of course. The Blue Box solution is comprehensive and reliable but it requires quite a bit of work to get it going, and you also need to know which database table fields you want to work on specifically – so it can be time consuming unless you’re prepared to really sit down and work on it, either to process everything manually or write a script to do it all for you.

Fortunately there’s an easier way, as described here – basically, all you need to do is export your current dataset with mysqldump, forcing it to latin1, and then re-import it as UTF-8:

mysqldump -h DB_HOST -u DB_USER -p –opt –quote-names –skip-set-charset –default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

mysql -h DB_HOST -u DB_USER -p –default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

We did this for AusGamers.com and it worked perfectly – the only caveat you need to be aware of is that it will mess up UTF-8 characters that are properly encoded aleady. For us this wasn’t a big deal as we were able to clearly identify them and fix them manually.

StackOverflow has yet another approach which might be suitable if you’re dealing with only one or two tables and just want to fix it from the MySQL console or phpMyAdmin or whatever – changing the table character sets on the fly:

ALTER TABLE [tableName] MODIFY [columnName] [columnType] CHARACTER SET latin1
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET binary
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET utf8

This method worked fine for me in a test capacity on a single table but we didn’t end up using it everywhere.

Should I Gzip Content Before Putting it in MySQL?

The answer for us was “yes”, although there’s a lot more to it than that. I just wrote about doing this on AusGamers for a table that was causing us a lot of grief with really slow DELETEs due to the huge volume of data in there.

I found that gzip’ing the content before putting it into the database made a massive difference to performance – queries that would usually take minutes to run because they were removing up to gigabytes of data suddenly were dealing with 10x less bytes, which made a huge impact to the execution time.

The results were obvious – you can see in the graphs below the impact that was made.

This change might not be useful in all circumstances – obviously at some point the CPU overhead of gzip’ing might cause more problems than its worth, or something. But if you’re dealing with multi-megabyte chunks of text that MySQL only needs to pull in and out (ie, you don’t need to sort by the contents or do anything else with that data from within MySQL), it’s probably worth trying.

Securing WordPress Using a Separate, Privileged Apache Vhost

Something I’ve been meaning to check out for a while – locking down WordPress to make it really secure. It’s always freaked me out a bit having web server-writable directories, but it just makes WordPress so powerful and, frankly, easy to use.

I checked out the hardening guide on the official WordPress site. It has a bunch of tips about how to set file system permissions, but at the end of the day you basically need to keep certain directories world-writable if you want to have that handy functionality that lets you do things like install plugins, edit themes, and automatically update.

However, after reading about a new zero-day exploit in a particular file that is packaged with many WordPress themes (not one that I happened to have installed), it drove me to action, along with the realisation that basically none of those simply hardening things is going to be useful if your site is set up with web-writable directories. If there’s an exploit in code – whether it’s core WP code or some random thing you’ve added in a plugin or theme – chances are you’ll be vulnerable.

So I have decided to try something else.

1) I’ve chowned all the files in my WordPress directory to be a non-web user, but left o+rx, which means the web process can happily read everything and serve my files – but it can no longer write to the directory. This of course means all that functionality I mentioned above no longer works.

2) I’ve created a new Apache vhost on my VPS on a separate port. As I am running ITK MTM – a module for Apache that allows me to specify what uid/gid the Apache process will run at on a per-user basis – I can tell this vhost to run as the same username as the non-web user that owns all the files.

3) I’ve made a tiny change to my wp-config.php file so that it lets me access this WordPress instance on the vhost without rewriting the URLs and forwarding me back to the main vhost. I just did something like this:


$t_port = 8958;
$t_servername = 'https://trog.qgl.org';
if ($_SERVER['SERVER_PORT'] == $t_port)
$t_servername .= ":$t_port";
define('WP_SITEURL', $t_servername);
define('WP_HOME', $t_servername);

4) Now, when I want to perform administrative tasks in WordPress, I just need to remember to access my /wp-admin directory via the https://trog.qgl.org:8958/ vhost.

5) Throw some extra security on this new vhost. I just whapped on a .htaccess in the vhost configuration, but you can do whatever you want – IP restrictions, or whatever.

After doing some basic testing to confirm it was all working as expected, I then went to write this post. I hit ‘save draft’ and was promptly greeted with a bizarre error from my WPSearch plugin (“Fatal error: Call to a member function find() on a non-object in [..]/wp-content/plugins/wpsearch/WPSearch/Drivers/Search/Phplucene.php”). This was mysterious! What had I done wrong?

So I looked through the code and WPSearch and trying to figure out what was going on. Eventually I realised – I’d tried writing this post from my non-privileged vhost. WPSearch must need to write to the disk somewhere as the web user – presumably to update the search index – and it was failing with that error because it wasn’t expecting suddenly to be able to no longer write to the disk (presumably when installing WPSearch it tells you if your file permissions are incorrect for usage).

After that I jumped back in to my privileged vhost and rewrote the post – and so far, so good. I’ll test this for a bit longer but to me it seems like an obvious way of running a more secure instance of WordPress, albeit with a bit more messing around.

Important notes:

Any plugin that you’re running that needs to interact by writing to the disk as part of its usual process will probably fail.

WP Super Cache is one that I’m using that will simply not work with this method – cache requests fail silently from the public interface and the cache simply will not function.

To fix this you need to find out what it needs to write to and give it full permission (which somewhat obviates the point of this exercise, but I’d much rather have only the cache directory world-writable) – in this case, ‘chmod o+w ./wp-content/cache’ fixes up WP Super Cache.

I’ll add more as I discover more.

Updated 2011-08-03: Added WP_HOME into step 3; it is required for various reasons – things like WP Super Cache and the permalinks menu break without it.

Updated 2011-08-15: A new problem – adding images into a post while you’re using the ‘admin port’ means that they’ll get referenced with this port. Not sure how to work around that one.

Differences in Requesting gzip’ed Content using curl in PHP

There are some slight differences in the way curl requests are handled when you’re requested gzip’ed content from a web server. I found these slightly non-obvious, although it’s really pretty clear, but in the interests of trying to clarify I thought I’d write this down.

If you want to use curl to retrieve gzip’ed content from a webserver, you simply do something like this:


$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
$data = curl_exec($ch);

What I found that was weird was that when I did something like ‘strlen($data)’ after that call, the result clearly indicated that the retrieved data was not compressed – strlen() was reporting 100 kbytes, but when I wget’ed the same page gzip’ed, I could see that it was only around 10 kbytes.

I added the header option to the curl request so I could see what was going on, so the code became:


$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
curl_setopt($ch, CURLOPT_HEADER, true);
$data = curl_exec($ch);

This yielded something like:

HTTP/1.1 200 OK
Date: Thu, 28 Jul 2011 23:03:42 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Mono
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 11091
Connection: close
Content-Type: text/html; charset=UTF-8

So the web server thought was clearly returning a compressed document, as it matched the ~10 kbyte figure I was seeing with wget, but the actual size of the $data variable was out of whack with this.

As it turns out, CURLOPT_ENCODING actually also controls whether the curl request decodes the response from the webserver. So in addition to setting the required header for the request, it also transparently decompresses it so you can deal directly with the uncompressed content. Upon reflection, this is a little obvious if you just read the manual page.

Basically, the problem was that I was expecting (and wanting) to get a binary chunk of compressed data. This was not the case, but what curl was doing worked out fine for me anyway.

However, I did figure out how to get the binary chunk that I was initially wanting. Basically instead of using the CURLOPT_ENCODING option, you just add a header to the request and set binary transfer mode on, so the code simply becomes:


$headers[] = "Accept-Encoding: gzip";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$data = curl_exec($ch);

This will return the gzip’ed chunk of binary gibberish to $data (which, of course, will be much smaller when you run strlen() on it).

PHP Compression: gzcompress vs gzdeflate vs gzencode

Really, PHP? You have three different zlib functions for compressing? I’m sure there’s an excellent reason for this but I’ve barely looked at zlib in PHP ever so was a bit surprised at the variety and subtle differences between them.

I happened to pick gzcompress() initially and struggled a bit trying to figure out what was going on – it seems to produce a consistent two byte header of 78 5e, but that is different to what is mentioned in the magic number listing I found – gzip is listed as 1f 8b 08, which is what you’ll see if you use gzencode(). gzdeflate() doesn’t seem to leave a header at all.

This post on Stack Overflow has a little info about the differences; which one you’ll need depends exactly on what you’re doing.

To make things even more awesome though, just after I decided I want to use gzencode(), I discovered that gzdecode() isn’t actually implemented in PHP 5.3 – apparently it is scheduled for PHP 6, so presumably gzencode() is only useful to those who have another mechanism to extract gzip’ed data.

I did a very quick benchmark with about 30 files totaling around 130MB and got the following results using compression option 4, though I tested on 9 and there was little difference:

gzdeflate():

real 0m5.562s
user 0m5.436s
sys 0m0.125s

gzencode():
real 0m5.679s
user 0m5.566s
sys 0m0.111s

gzcompress():

real 0m6.011s
user 0m5.878s
sys 0m0.131s

Securing Apache/PHP in Shared Hosting Environments

Every couple of years I get interested in figure out how to build a new shared hosting platform, like the one we used to run on AusGamers back in the day – we hosted over a thousand websites for things like gaming clans, hobby sites, and so on. We had a great, simple system that a few people hacked on that automatically provisioned an Apache setup, MySQL databases, and email stuff. It was basic, but it did the job and meant we were able to easily provide a system to host all those sites, which we did on a single box.

This service was completely free and not considered mission criticial; security was something we were vaguely concerned about but never really spent a lot of time on it. My big concern was the PHP processing side – on most shared hosting platforms, PHP runs in the context of the Apache process, so if you’re not really careful with your permissions you can end up creating issues between your various sites.

So every few years I rack my brain to try to remember what options I looked at last time. This time, I’m going to write them down as I come across them so I can find them again easily when I repeat the process.

The solutions I am most interested in are the ones that let Apache run with only certain permissions based on who owns the files – so you can, for example, have multiple web roots, each owned by a different uid/gid (so sites have their own user account in the host operating system), and PHP’s access is limited to each directory as it executes as that user.

Here’s what I’ve found so far, in rough order of what I’m going to try:

MPM-ITK – non-threaded so more stable but performance hit. Each vhost runs with its own uid/gid. Available in most major distro repositories (including Debian/Ubuntu). Runs as root. Last update: Apr 2009.

suPHP – Apache module + setuid root binary that changes the uid of the process running a PHP script. Last update: Mar 2009.

Peruser MPM – run each Apache process as its own uid/gid. Apparently has better performance under some circumstances, may need to use non-threaded version for better stability. Last update: Oct 2007.

muxmpm/metuxmpm – refers to this page which is a 404, no other readily available information.

Possibly to be used in conjunction with PHP’s open_basedir directive.

Does Debian's Packaged PHP Include Suhosin?

I had noticed several times running PHP scripts that the default PHP install that comes from Debian repositories at the moment seems to include Suhosin – running ‘php -v’ yields:

PHP 5.2.6-1+lenny9 with Suhosin-Patch 0.9.6.2 (cli) (built: Aug 4 2010 03:25:57)

I had assumed this meant Suhosin was installed, but I was a bit confused as to why Suhosin functions like sha256() and sha256_file() didn’t exist, and also why the constant CRYPT_BLOWFISH didn’t appear to be set.

After a bit of looking around I finally thought to look at the actual Debian package page, which indicates there’s actually two variants of PHP/Suhosin:

The first part is a small patch against the PHP core, that implements a few low-level protections against bufferoverflows or format string vulnerabilities and the second part is a powerful PHP extension that implements all the other protections.

So I assume the one that comes with PHP when installed via the usual apt-get method is the first variant, and if you want the fully-fledged Suhosin you’ll need to figure out how to install the other one.