Image Data Only Hashing of JPEG Files

As part of a small project to verify backups, I came across a case where I had two photos that looked identical but with different EXIF data.

The backup verification system (correctly) flagged these as two different files – as the SHA1 file hashes were different. However, the actual photo was – as far as I could tell – absolutely identical, so I started looking to see if there was a way to verify JPEG files based on the image data alone (instead of the entire file, which would include meta stuff like the EXIF data).

A quick look around revealed that ImageMagick has a “signature hash” function as part of ‘identify‘, which sounded perfect. You can test it like so:

identify.exe -verbose -format “%#” test.jpg

At first glance this solved the problem, but testing on a few systems showed that I was getting different hashes for the same file – it looked like different versions of ImageMagick return a different hash. I’ve asked about this on their forum and was told that the signature algorithm has changed a few times – which makes it sort of useless if compatibility across platforms is required.

After looking around a bit more for alternative I found the (possibly Australian made?) PHP JPEG Metadata Toolkit, which (amongst many other things) includes a get_jpeg_image_data() function which (so far) seems to work reliably across systems. Pulling the data out and running it through SHA1 gives a simple usable way to hash the image-only data in a JPEG file.

Location-based Advertising Goes Wrong; Clues about Dodgy Advertising

If you’re an astute observer, you might have noticed some elements – for example, advertising or some other content – on overseas web sites sometimes have some element on them that refers to the city in which you’re living in.

It might seem like an astonishing coincidence that an article on the Toronto Times or the South Xihuan Observer just happens to have something like this on their website at the exact same time you just happened to click through from Google… but it isn’t. It is the result of location-based advertising – detecting some information about you from your web browser and figuring out where you are. Usually this is done by your IP address and it is a simple look-up in some database that maintains a list of how geographical locations map to certain IP ranges (colloquially referred to as “GeoIP”).

This is not an exact science, and as this screengrab from shows, sometimes things can go wrong:

This is probably just a simple programming error – the “REGION” tag should have been replaced with my actual region.

This is mostly a fascinatingly boring example of a web site bug.

The only interesting thing is that it clearly highlights that the module with that error is engaging in deception to try to trick you into clicking on it. Clearly, this is not a “new trick in your region” – it is some bullshit generic factoid, presumably about car insurance, that they’re trying to bait you into clicking by implying that it is related to where you live.

There are, of course, other location-based clues in this (rather poor) ad – it has what is pretty clearly a US police department patrol car, and the text of the ad refers to “miles per day” – so hopefully even the casual Australian Internet user would start hearing alarm bells.

While it almost certainly isn’t a scam and probably poses no real “danger”, it’s important for people to be alert for little tricks like this that attempt to change your behaviour by appealing to you by “hitting you at home”, so to speak.

Sogou Search Engine Spider Smashing Websites

Was keeping an eye on our CPU usage on a newly provisioned VPS on which a part of AusGamers was recently transferred to and noticed a big, unusual spike in CPU usage:

Correlating this with another graph indicated it was something hitting our news or forum pages pretty hard, so I nabbed the Apache logs and quickly determined what it was – the “Sogou web spider”, hitting our front page twice a second, over and over again: – – [13/Sep/2011:10:52:16 +1000] “GET / HTTP/1.0” 301 233 “” “Sogou web spider/4.0(+h
ttp://” – – [13/Sep/2011:10:52:16 +1000] “GET / HTTP/1.0” 301 233 “” “Sogou web spider/4.0(+” – – [13/Sep/2011:10:52:17 +1000] “GET / HTTP/1.0” 301 233 “” “Sogou web spider/4.0(+” – – [13/Sep/2011:10:52:17 +1000] “GET / HTTP/1.0” 301 233 “” “Sogou web spider/4.0(+” – – [13/Sep/2011:10:52:17 +1000] “GET / HTTP/1.0” 301 233 “” “Sogou web spider/4.0(+” – – [13/Sep/2011:10:52:18 +1000] “GET / HTTP/1.0” 301 233 “” “Sogou web spider/4.0(+”

… and so on, for a total of 18,763 requests Eventually it moved on to our different pages, but I stopped counting.

The URL in our logs directs you to a Chinese language FAQ, which when run through the awesome translate feature in Chrome directs you to a form for which you can submit a complaint about “crawling too fast”. I did that (in English) and will be fascinated to see if I get a response.

In the meantime, we just blocked the IP address.

Fixing Double Encoded Characters in MySQL

If you’re working on any old PHP/MySQL sites, chances are at some point you’re going to need to get into the murky, painful world of character encoding – presumably to convert everything to UTF-8 from whatever original setup you have. It is not fun, but fortunately many people have gone through it before and paved the way with a collection of useful information and scripts.

One problem which struck us recently when migrating our database server was certain characters being “double encoded”. This appears to be relatively common. For us, the cause was exporting our data – all UTF-8 data but stored in tables that were latin1 – via mysqldump and then importing again as if it was UTF-8. This means something like the characters are detected as multibyte, but because the source and destinations were different, they’re re-encoded – so you end up with these double encoded characters that look like squiggly gibberish appearing in all your web pages.

Nathan over at the Blue Box Group has written an extremely comprehensive guide to problems like this. It explains the root cause of these problems, the common symptoms, and – of course, most importantly – precise details on how to safely fix them. If you’re doing anything at all involved in changing character encoding then it is worth a read even before you have problems, just so you can get a better handle on how to fix things and what your end game should be.

There’s a few other ways to fix it, of course. The Blue Box solution is comprehensive and reliable but it requires quite a bit of work to get it going, and you also need to know which database table fields you want to work on specifically – so it can be time consuming unless you’re prepared to really sit down and work on it, either to process everything manually or write a script to do it all for you.

Fortunately there’s an easier way, as described here – basically, all you need to do is export your current dataset with mysqldump, forcing it to latin1, and then re-import it as UTF-8:

mysqldump -h DB_HOST -u DB_USER -p –opt –quote-names –skip-set-charset –default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

mysql -h DB_HOST -u DB_USER -p –default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

We did this for and it worked perfectly – the only caveat you need to be aware of is that it will mess up UTF-8 characters that are properly encoded aleady. For us this wasn’t a big deal as we were able to clearly identify them and fix them manually.

StackOverflow has yet another approach which might be suitable if you’re dealing with only one or two tables and just want to fix it from the MySQL console or phpMyAdmin or whatever – changing the table character sets on the fly:

ALTER TABLE [tableName] MODIFY [columnName] [columnType] CHARACTER SET latin1
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET binary
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET utf8

This method worked fine for me in a test capacity on a single table but we didn’t end up using it everywhere.

Setting Up Infobox Templates in MediaWiki

Note: This guide has been updated as of 2014-09-22 for MediaWiki v1.23. If you’re using this version (or later) please see the Infoboxes in MediaWiki v1.23 post.

** Click here for the updated post. **

If you’ve ever been to any of the more structured Wikipedia pages you probably have seen the neat “infoboxes” that they have on the right hand side. They’re a neat, convenient way to get some of the core metainfo from an article.

If you have your own MediaWiki instance, you’ve probably thought they’d be a nice thing to have, so maybe you copy and pasted the code from Wikipedia and then were surprised when it didn’t just magically work. Turns out that the infobox stuff is part of MediaWiki’s extensive Templating system, so first of all you need the templates. Sounds easy, right?

Well, no. You don’t just flip a switch or download a file, and when you do a search you might find this article which details a process that it says might take 60-90 minutes.

I started looking into it and quickly got lost; you basically need to create a billion different Templates and do all sorts of weird stuff to get it to work. Fortunately I stumbled across this discussion which contained a clue that greatly simplifies the process.

I was able to distill the steps down to a process that I was able to reproduce on a new MediaWiki install in about five minutes. Before we start, I’ll throw in the warning that I have not read the documentation and I don’t understand at a low level what is happening with the templating. I just wanted a working, simple infobox.

  1. Download the MediaWiki extension ParserFunctions and add it to your LocalSettings.php as referred to there.
  2. Copy the CSS required to support the infobox from Wikipedia to your Wiki. The CSS is available in Common.css. You’ll probably need to create the stylesheet – it will be at http://your_wiki/wiki/index.php?title=MediaWiki:Common.css&action=edit – and then you can just copy/paste the contents in there. (I copied the whole file; you can probably just copy the infobox parts.)
  3. Export the infobox Template from Wikipedia:
    1. Go to Wikipedia’s Special:Export page
    2. Leave the field for ‘Add pages from category’ empty
    3. In the big text area field, just put in “Template:Infobox”.
    4. Make sure the three options – “Include only the current revision, not the full history”, “Include templates”, and “Save as file” – are all checked
    5. Hit the ‘Export’ button; it will think for a second then spit out an XML file containing all the Wikipedia Templates for the infobox for you to save to your PC.
  4. Now you have the Template, you need to integrate them into your MediaWiki instance. Simply go to your Import page – http://your_wiki/wiki/index.php/Special:Import – select the file and then hit ‘Upload file’. NOTE: see update at the bottom of the page before doing this.
  5. With the Templates and styles added you should be able to now add a simple infobox. Pick a page and add something like this to the top:{{Infobox
    |title = Infobox Title
    |header1 = Infobox Header
    |label2 = Created by
    |data2 = David
    |label3 = External reference
    |data3 = []

The full infobox Template docs are available here – there’s a lot of stuff in there, but if you just want a really basic infobox then this is the simplest way I found to get them working.

I tested this on two separate MediaWiki installs – one running v1.12.1 and one on v1.15.1 – and it worked on both of them, but as always YMMV.

Update 2013-07-27

As many people have noticed, the guide no longer works. Thanks to commenters jh and chojin, it looks like you also need to do the following:

  • Install the Scribunto extension and add it to your LocalSettings.php as usual. It looks like this extension is now required for the InfoBox templates (in fact, it looks like it replaces ParserFunctions entirely, but I’m still testing that).
  • The XML file that is output in step 3 appears to erroneously (?) use text/plain as the format type. If you edit this XML file in your text editor and replace all incidents of ‘text/plain’ with ‘CONTENT_FORMAT_TEXT’ (I only found two), the import will be successful and the infobox tags looks like they work.

If someone else can confirm this for me as a working solution I’ll revise the original post so it takes these steps into account.

Should I Gzip Content Before Putting it in MySQL?

The answer for us was “yes”, although there’s a lot more to it than that. I just wrote about doing this on AusGamers for a table that was causing us a lot of grief with really slow DELETEs due to the huge volume of data in there.

I found that gzip’ing the content before putting it into the database made a massive difference to performance – queries that would usually take minutes to run because they were removing up to gigabytes of data suddenly were dealing with 10x less bytes, which made a huge impact to the execution time.

The results were obvious – you can see in the graphs below the impact that was made.

This change might not be useful in all circumstances – obviously at some point the CPU overhead of gzip’ing might cause more problems than its worth, or something. But if you’re dealing with multi-megabyte chunks of text that MySQL only needs to pull in and out (ie, you don’t need to sort by the contents or do anything else with that data from within MySQL), it’s probably worth trying.

Securing WordPress Using a Separate, Privileged Apache Vhost

Something I’ve been meaning to check out for a while – locking down WordPress to make it really secure. It’s always freaked me out a bit having web server-writable directories, but it just makes WordPress so powerful and, frankly, easy to use.

I checked out the hardening guide on the official WordPress site. It has a bunch of tips about how to set file system permissions, but at the end of the day you basically need to keep certain directories world-writable if you want to have that handy functionality that lets you do things like install plugins, edit themes, and automatically update.

However, after reading about a new zero-day exploit in a particular file that is packaged with many WordPress themes (not one that I happened to have installed), it drove me to action, along with the realisation that basically none of those simply hardening things is going to be useful if your site is set up with web-writable directories. If there’s an exploit in code – whether it’s core WP code or some random thing you’ve added in a plugin or theme – chances are you’ll be vulnerable.

So I have decided to try something else.

1) I’ve chowned all the files in my WordPress directory to be a non-web user, but left o+rx, which means the web process can happily read everything and serve my files – but it can no longer write to the directory. This of course means all that functionality I mentioned above no longer works.

2) I’ve created a new Apache vhost on my VPS on a separate port. As I am running ITK MTM – a module for Apache that allows me to specify what uid/gid the Apache process will run at on a per-user basis – I can tell this vhost to run as the same username as the non-web user that owns all the files.

3) I’ve made a tiny change to my wp-config.php file so that it lets me access this WordPress instance on the vhost without rewriting the URLs and forwarding me back to the main vhost. I just did something like this:

$t_port = 8958;
$t_servername = '';
if ($_SERVER['SERVER_PORT'] == $t_port)
$t_servername .= ":$t_port";
define('WP_SITEURL', $t_servername);
define('WP_HOME', $t_servername);

4) Now, when I want to perform administrative tasks in WordPress, I just need to remember to access my /wp-admin directory via the vhost.

5) Throw some extra security on this new vhost. I just whapped on a .htaccess in the vhost configuration, but you can do whatever you want – IP restrictions, or whatever.

After doing some basic testing to confirm it was all working as expected, I then went to write this post. I hit ‘save draft’ and was promptly greeted with a bizarre error from my WPSearch plugin (“Fatal error: Call to a member function find() on a non-object in [..]/wp-content/plugins/wpsearch/WPSearch/Drivers/Search/Phplucene.php”). This was mysterious! What had I done wrong?

So I looked through the code and WPSearch and trying to figure out what was going on. Eventually I realised – I’d tried writing this post from my non-privileged vhost. WPSearch must need to write to the disk somewhere as the web user – presumably to update the search index – and it was failing with that error because it wasn’t expecting suddenly to be able to no longer write to the disk (presumably when installing WPSearch it tells you if your file permissions are incorrect for usage).

After that I jumped back in to my privileged vhost and rewrote the post – and so far, so good. I’ll test this for a bit longer but to me it seems like an obvious way of running a more secure instance of WordPress, albeit with a bit more messing around.

Important notes:

Any plugin that you’re running that needs to interact by writing to the disk as part of its usual process will probably fail.

WP Super Cache is one that I’m using that will simply not work with this method – cache requests fail silently from the public interface and the cache simply will not function.

To fix this you need to find out what it needs to write to and give it full permission (which somewhat obviates the point of this exercise, but I’d much rather have only the cache directory world-writable) – in this case, ‘chmod o+w ./wp-content/cache’ fixes up WP Super Cache.

I’ll add more as I discover more.

Updated 2011-08-03: Added WP_HOME into step 3; it is required for various reasons – things like WP Super Cache and the permalinks menu break without it.

Updated 2011-08-15: A new problem – adding images into a post while you’re using the ‘admin port’ means that they’ll get referenced with this port. Not sure how to work around that one.

Uploading Facebook Photos to a Page Via Email

Every time I want to do this I forget where to go to find the email address.

I am looking after several Facebook Pages for work – the capital-P “Page” that you create for a company. I’m often out-and-about and sometimes want to be able to snap a photo on my mobile phone and then add it to Facebook – but the default sharing stuff built in to Android is only hooked in to my personal Facebook account, and there’s no provision to do things on Pages for which I am an administrator.

I don’t want to log into the mobile version of the site and use it that way either, because it’s simply a pain to do on mobile. Fortunately, Facebook have you covered – you can simply take a photo, ‘share’ it via email, and mail it to a private mailbox for each of your pages.

The email address is never where I think it is though – I often click on the ‘Photos’ link on the left side of the Page profile and then go through the upload process looking for it, but it isn’t available there.

There are two places that I can find it listed – neither require you to “Use Facebook as Page” for the Page, but I usually do it anyway.

1) Click ‘edit info’ at the top of the Page to go into profile editing mode. Select ‘Mobile’ from the new left side menu and look for the ‘With Mobile Email’ section.

2) At the top of your Page, look for the usual Share module (where you’d go to post a status update) and hit the ‘Photo’ link. Select “Upload a photo from your drive” (even though that’s not what you’re doing) and just below the “browse” option that appears you’ll see a new link called “upload via email” which, when clicked, will present your upload email address.

You should keep this email address private – failure to do so will mean anyone with it can add photos to your Facebook Page. Fortunately Facebook thought of that already and has an option that lets you reset the email address if it is compromised.

Fixing ICS Timezones for Outlook

If you’ve ever used a web application that generates a .ics calendar file for you to import into Outlook, you might have run into problems with timezones. I’ve gone through the same thing the last couple of years with the GDC schedule builder webapp, which is really handy but insists on sending me an .ics file with the local US/Pacific times. This means that when I import it they all end up wrong (eg, a 10am lecture is stored in my calendar as 10am Brisbane time, so when I shift timezones all the meetings end up getting moved and I get appointment alarms going off in the middle of the night, which is annoying when you’re jetlagged).

The only way to easily fix it that I could find is to manually open each calendar item after importing and adjust the timezone. This is time consuming and boring.

I had a quick look around and couldn’t find something to fix this easily, so I wrote a very simple .ics timezone fixer which reads in a pasted .ics file and will convert the timestamps in it to UTC based on a couple of simple rules, so that when you import them into your calendar the times are stored correctly.

Securing Apache/PHP in Shared Hosting Environments

Every couple of years I get interested in figure out how to build a new shared hosting platform, like the one we used to run on AusGamers back in the day – we hosted over a thousand websites for things like gaming clans, hobby sites, and so on. We had a great, simple system that a few people hacked on that automatically provisioned an Apache setup, MySQL databases, and email stuff. It was basic, but it did the job and meant we were able to easily provide a system to host all those sites, which we did on a single box.

This service was completely free and not considered mission criticial; security was something we were vaguely concerned about but never really spent a lot of time on it. My big concern was the PHP processing side – on most shared hosting platforms, PHP runs in the context of the Apache process, so if you’re not really careful with your permissions you can end up creating issues between your various sites.

So every few years I rack my brain to try to remember what options I looked at last time. This time, I’m going to write them down as I come across them so I can find them again easily when I repeat the process.

The solutions I am most interested in are the ones that let Apache run with only certain permissions based on who owns the files – so you can, for example, have multiple web roots, each owned by a different uid/gid (so sites have their own user account in the host operating system), and PHP’s access is limited to each directory as it executes as that user.

Here’s what I’ve found so far, in rough order of what I’m going to try:

MPM-ITK – non-threaded so more stable but performance hit. Each vhost runs with its own uid/gid. Available in most major distro repositories (including Debian/Ubuntu). Runs as root. Last update: Apr 2009.

suPHP – Apache module + setuid root binary that changes the uid of the process running a PHP script. Last update: Mar 2009.

Peruser MPM – run each Apache process as its own uid/gid. Apparently has better performance under some circumstances, may need to use non-threaded version for better stability. Last update: Oct 2007.

muxmpm/metuxmpm – refers to this page which is a 404, no other readily available information.

Possibly to be used in conjunction with PHP’s open_basedir directive.