Sogou Search Engine Spider Smashing Websites
Sep 13th, 2011Was keeping an eye on our CPU usage on a newly provisions VPS on which a part of AusGamers was recently transferred to and noticed a big, unusual spike in CPU usage:

Correlating this with another graph indicated it was something hitting our news or forum pages pretty hard, so I nabbed the Apache logs and quickly determined what it was – the “Sogou web spider”, hitting our front page twice a second, over and over again:
199.119.201.102 – - [13/Sep/2011:10:52:16 +1000] “GET / HTTP/1.0″ 301 233 “http://www.ausgamers.com” “Sogou web spider/4.0(+h
ttp://www.sogou.com/docs/help/webmasters.htm#07)”
199.119.201.102 – - [13/Sep/2011:10:52:16 +1000] “GET / HTTP/1.0″ 301 233 “http://www.ausgamers.com” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
199.119.201.102 – - [13/Sep/2011:10:52:17 +1000] “GET / HTTP/1.0″ 301 233 “http://www.ausgamers.com” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
199.119.201.102 – - [13/Sep/2011:10:52:17 +1000] “GET / HTTP/1.0″ 301 233 “http://www.ausgamers.com” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
199.119.201.102 – - [13/Sep/2011:10:52:17 +1000] “GET / HTTP/1.0″ 301 233 “http://www.ausgamers.com” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
199.119.201.102 – - [13/Sep/2011:10:52:18 +1000] “GET / HTTP/1.0″ 301 233 “http://www.ausgamers.com” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
… and so on, for a total of 18,763 requests Eventually it moved on to our different pages, but I stopped counting.
The URL in our logs directs you to a Chinese language FAQ, which when run through the awesome translate feature in Chrome directs you to a form for which you can submit a complaint about “crawling too fast”. I did that (in English) and will be fascinated to see if I get a response.
In the meantime, we just blocked the IP address.
Comments: 1 --
Fixing Double Encoded Characters in MySQL
Sep 6th, 2011If you’re working on any old PHP/MySQL sites, chances are at some point you’re going to need to get into the murky, painful world of character encoding – presumably to convert everything to UTF-8 from whatever original setup you have. It is not fun, but fortunately many people have gone through it before and paved the way with a collection of useful information and scripts.
One problem which struck us recently when migrating our database server was certain characters being “double encoded”. This appears to be relatively common. For us, the cause was exporting our data – all UTF-8 data but stored in tables that were latin1 – via mysqldump and then importing again as if it was UTF-8. This means something like the characters are detected as multibyte, but because the source and destinations were different, they’re re-encoded – so you end up with these double encoded characters that look like squiggly gibberish appearing in all your web pages.
Nathan over at the Blue Box Group has written an extremely comprehensive guide to problems like this. It explains the root cause of these problems, the common symptoms, and – of course, most importantly – precise details on how to safely fix them. If you’re doing anything at all involved in changing character encoding then it is worth a read even before you have problems, just so you can get a better handle on how to fix things and what your end game should be.
There’s a few other ways to fix it, of course. The Blue Box solution is comprehensive and reliable but it requires quite a bit of work to get it going, and you also need to know which database table fields you want to work on specifically – so it can be time consuming unless you’re prepared to really sit down and work on it, either to process everything manually or write a script to do it all for you.
Fortunately there’s an easier way, as described here – basically, all you need to do is export your current dataset with mysqldump, forcing it to latin1, and then re-import it as UTF-8:
mysqldump -h DB_HOST -u DB_USER -p –opt –quote-names –skip-set-charset –default-character-set=latin1 DB_NAME > DB_NAME-dump.sqlmysql -h DB_HOST -u DB_USER -p –default-character-set=utf8 DB_NAME < DB_NAME-dump.sql
We did this for AusGamers.com and it worked perfectly – the only caveat you need to be aware of is that it will mess up UTF-8 characters that are properly encoded aleady. For us this wasn’t a big deal as we were able to clearly identify them and fix them manually.
StackOverflow has yet another approach which might be suitable if you’re dealing with only one or two tables and just want to fix it from the MySQL console or phpMyAdmin or whatever – changing the table character sets on the fly:
ALTER TABLE [tableName] MODIFY [columnName] [columnType] CHARACTER SET latin1
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET binary
ALTER TABLE MyTable [tableName] [columnName] [columnType] CHARACTER SET utf8
This method worked fine for me in a test capacity on a single table but we didn’t end up using it everywhere.
Comments: 0 --
Trials and Tribulations of Updating PGP Desktop
Aug 31st, 2011I somehow missed the news in April last year that Symantec would be acquiring PGP. Symantec doesn’t exactly have a stellar reputation amongst technical people (my Dell laptop still has some mystical, seemingly uninstallable software components from a Symantec product that was on there when I bought it that I could never get rid of), so I’m sure if I had known about it, it would have filled me with dread.
I found out about it today when I loaded PGP Desktop and realised I hadn’t checked for updates for a while. Normally I haven’t needed to – PGP were pretty good about emailing me about updates. So I opened the application and hit Help->Update. After a split second of thinking, I’m greeted with a dialog telling me: “Product manifest from the PGP Corporation update server fails the integrity check. Please try again later.” I tried again later, same thing, so I did the next step anyone would try when troubleshooting and Googled the error message.
I was directed to this thread on the Symantec forums (never a good sign when the first hits aren’t in some support knowledge base). Fortunately, it had a reply from a Symantec tech support person, so that was good news.
The reply advised users experiencing the problem to download this PDF. Another bad sign. Why isn’t this just linked on a website? Load the PDF and you’re greeted with something that looks like this:

Really? You can’t even get the slashes the right way around in your hyperlinks? Dread level increasing.
Anyway, I tried the process. Went to the URL in point 1 and was told I need to sign up for an account. No worries, makes sense after reading the rest of the document – you get access to a license management section in the Symantec website, so an account seems like a reasonable thing. A relatively painless process; didn’t even need to activate. Tried to log in – more dread:

Augh. Stuck.
I realise that Symantec probably have a bit of work to do as part of the changeover – they say as much in the forum post. But getting software updates seems like enough of a Big Deal to warrant a bit more effort – not to say attention to detail – if they expect corporate customers to want to keep coming back. If I wanted to go to all this effort with desktop encryption software and keeping it up to date, I’d be using GPG.
Comments: 0 --
Setting Up Infobox Templates in MediaWiki
Aug 15th, 2011If you’ve ever been to any of the more structured Wikipedia pages you probably have seen the neat “infoboxes” that they have on the right hand side. They’re a neat, convenient way to get some of the core metainfo from an article.
If you have your own MediaWiki instance, you’ve probably thought they’d be a nice thing to have, so maybe you copy and pasted the code from Wikipedia and then were surprised when it didn’t just magically work. Turns out that the infobox stuff is part of MediaWiki’s extensive Templating system, so first of all you need the templates. Sounds easy, right?
Well, no. You don’t just flip a switch or download a file, and when you do a search you might find this article which details a process that it says might take 60-90 minutes.
I started looking into it and quickly got lost; you basically need to create a billion different Templates and do all sorts of weird stuff to get it to work. Fortunately I stumbled across this discussion which contained a clue that greatly simplifies the process.
I was able to distill the steps down to a process that I was able to reproduce on a new MediaWiki install in about five minutes. Before we start, I’ll throw in the warning that I have not read the documentation and I don’t understand at a low level what is happening with the templating. I just wanted a working, simple infobox.
- Download the MediaWiki extension ParserFunctions and add it to your LocalSettings.php as referred to there.
- Copy the CSS required to support the infobox from Wikipedia to your Wiki. The CSS is available in Common.css. You’ll probably need to create the stylesheet – it will be at http://your_wiki/wiki/index.php?title=MediaWiki:Common.css&action=edit – and then you can just copy/paste the contents in there. (I copied the whole file; you can probably just copy the infobox parts.)
- Export the infobox Template from Wikipedia:
- Go to Wikipedia’s Special:Export page
- Leave the field for ‘Add pages from category’ empty
- In the big text area field, just put in “Template:Infobox”.
- Make sure the three options – “Include only the current revision, not the full history”, “Include templates”, and “Save as file” – are all checked
- Hit the ‘Export’ button; it will think for a second then spit out an XML file containing all the Wikipedia Templates for the inbobox for you to save to your PC.
- Now you have the Template, you need to integrate them into your MediaWiki instance. Simply go to your Import page – http://your_wiki/wiki/index.php/Special:Import – select the file and then hit ‘Upload file’.
- With the Templates and styles added you should be able to now add a simple infobox. Pick a page and add something like this to the top:
{{Infobox
|title = Infobox Title
|header1 = Infobox Header
|label2 = Created by
|data2 = David
|label3 = External reference
|data3 = [http://trog.qgl.org trog.qgl.org]
}}
The full infobox Template docs are available here – there’s a lot of stuff in there, but if you just want a really basic infobox then this is the simplest way I found to get them working.
I tested this on two separate MediaWiki installs – one running v1.12.1 and one on v1.15.1 – and it worked on both of them, but as always YMMV.
Comments: 1 --
WackGet v1.2.4 – Now Works on Windows 7 (and Vista)!
Aug 4th, 2011After a far too long delay, we’ve finally got a new version of WackGet which works on Windows 7 and Windows Vista!
Big thanks to Andrew at Mammoth Media for throwing his time into getting this updated and working. It turns out there were a couple of bugs that were present in previous versions that just happened to work on previous Windows versions, but the more recent Windows editions wouldn’t put up with that sort of crap and barfed.
Here’s the change log:
1.2.4 – Bugfixes
Corrected a bug related to invalid memory access that caused a crash on Windows Vista/7
Corrected a bug where a download may appear stuck for 10-15 seconds before starting
You can download v1.2.4 here. We’re just packaging up the source code and will have it available shortly.
Comments: 1 --
Should I Gzip Content Before Putting it in MySQL?
Aug 3rd, 2011The answer for us was “yes”, although there’s a lot more to it than that. I just wrote about doing this on AusGamers for a table that was causing us a lot of grief with really slow DELETEs due to the huge volume of data in there.
I found that gzip’ing the content before putting it into the database made a massive difference to performance – queries that would usually take minutes to run because they were removing up to gigabytes of data suddenly were dealing with 10x less bytes, which made a huge impact to the execution time.
The results were obvious – you can see in the graphs below the impact that was made.
This change might not be useful in all circumstances – obviously at some point the CPU overhead of gzip’ing might cause more problems than its worth, or something. But if you’re dealing with multi-megabyte chunks of text that MySQL only needs to pull in and out (ie, you don’t need to sort by the contents or do anything else with that data from within MySQL), it’s probably worth trying.
Comments: 0 --
Something I’ve been meaning to check out for a while – locking down WordPress to make it really secure. It’s always freaked me out a bit having web server-writable directories, but it just makes WordPress so powerful and, frankly, easy to use.
I checked out the hardening guide on the official WordPress site. It has a bunch of tips about how to set file system permissions, but at the end of the day you basically need to keep certain directories world-writable if you want to have that handy functionality that lets you do things like install plugins, edit themes, and automatically update.
However, after reading about a new zero-day exploit in a particular file that is packaged with many WordPress themes (not one that I happened to have installed), it drove me to action, along with the realisation that basically none of those simply hardening things is going to be useful if your site is set up with web-writable directories. If there’s an exploit in code – whether it’s core WP code or some random thing you’ve added in a plugin or theme – chances are you’ll be vulnerable.
So I have decided to try something else.
1) I’ve chowned all the files in my WordPress directory to be a non-web user, but left o+rx, which means the web process can happily read everything and serve my files – but it can no longer write to the directory. This of course means all that functionality I mentioned above no longer works.
2) I’ve created a new Apache vhost on my VPS on a separate port. As I am running ITK MTM – a module for Apache that allows me to specify what uid/gid the Apache process will run at on a per-user basis – I can tell this vhost to run as the same username as the non-web user that owns all the files.
3) I’ve made a tiny change to my wp-config.php file so that it lets me access this WordPress instance on the vhost without rewriting the URLs and forwarding me back to the main vhost. I just did something like this:
$t_port = 8958;
$t_servername = 'http://trog.qgl.org';
if ($_SERVER['SERVER_PORT'] == $t_port)
$t_servername .= ":$t_port";
define('WP_SITEURL', $t_servername);
define('WP_HOME', $t_servername);
4) Now, when I want to perform administrative tasks in WordPress, I just need to remember to access my /wp-admin directory via the http://trog.qgl.org:8958/ vhost.
5) Throw some extra security on this new vhost. I just whapped on a .htaccess in the vhost configuration, but you can do whatever you want – IP restrictions, or whatever.
After doing some basic testing to confirm it was all working as expected, I then went to write this post. I hit ‘save draft’ and was promptly greeted with a bizarre error from my WPSearch plugin (“Fatal error: Call to a member function find() on a non-object in [..]/wp-content/plugins/wpsearch/WPSearch/Drivers/Search/Phplucene.php”). This was mysterious! What had I done wrong?
So I looked through the code and WPSearch and trying to figure out what was going on. Eventually I realised – I’d tried writing this post from my non-privileged vhost. WPSearch must need to write to the disk somewhere as the web user – presumably to update the search index – and it was failing with that error because it wasn’t expecting suddenly to be able to no longer write to the disk (presumably when installing WPSearch it tells you if your file permissions are incorrect for usage).
After that I jumped back in to my privileged vhost and rewrote the post – and so far, so good. I’ll test this for a bit longer but to me it seems like an obvious way of running a more secure instance of WordPress, albeit with a bit more messing around.
Important notes:
Any plugin that you’re running that needs to interact by writing to the disk as part of its usual process will probably fail.
WP Super Cache is one that I’m using that will simply not work with this method – cache requests fail silently from the public interface and the cache simply will not function.
To fix this you need to find out what it needs to write to and give it full permission (which somewhat obviates the point of this exercise, but I’d much rather have only the cache directory world-writable) – in this case, ‘chmod o+w ./wp-content/cache’ fixes up WP Super Cache.
I’ll add more as I discover more.
Updated 2011-08-03: Added WP_HOME into step 3; it is required for various reasons – things like WP Super Cache and the permalinks menu break without it.
Updated 2011-08-15: A new problem – adding images into a post while you’re using the ‘admin port’ means that they’ll get referenced with this port. Not sure how to work around that one.
Comments: 0 --
Differences in Requesting gzip’ed Content using curl in PHP
Jul 29th, 2011There are some slight differences in the way curl requests are handled when you’re requested gzip’ed content from a web server. I found these slightly non-obvious, although it’s really pretty clear, but in the interests of trying to clarify I thought I’d write this down.
If you want to use curl to retrieve gzip’ed content from a webserver, you simply do something like this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
$data = curl_exec($ch);
What I found that was weird was that when I did something like ‘strlen($data)’ after that call, the result clearly indicated that the retrieved data was not compressed – strlen() was reporting 100 kbytes, but when I wget’ed the same page gzip’ed, I could see that it was only around 10 kbytes.
I added the header option to the curl request so I could see what was going on, so the code became:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
curl_setopt($ch, CURLOPT_HEADER, true);
$data = curl_exec($ch);
This yielded something like:
HTTP/1.1 200 OK
Date: Thu, 28 Jul 2011 23:03:42 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Mono
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 11091
Connection: close
Content-Type: text/html; charset=UTF-8
So the web server thought was clearly returning a compressed document, as it matched the ~10 kbyte figure I was seeing with wget, but the actual size of the $data variable was out of whack with this.
As it turns out, CURLOPT_ENCODING actually also controls whether the curl request decodes the response from the webserver. So in addition to setting the required header for the request, it also transparently decompresses it so you can deal directly with the uncompressed content. Upon reflection, this is a little obvious if you just read the manual page.
Basically, the problem was that I was expecting (and wanting) to get a binary chunk of compressed data. This was not the case, but what curl was doing worked out fine for me anyway.
However, I did figure out how to get the binary chunk that I was initially wanting. Basically instead of using the CURLOPT_ENCODING option, you just add a header to the request and set binary transfer mode on, so the code simply becomes:
$headers[] = "Accept-Encoding: gzip";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$data = curl_exec($ch);
This will return the gzip’ed chunk of binary gibberish to $data (which, of course, will be much smaller when you run strlen() on it).
Comments: 0 --
PHP Compression: gzcompress vs gzdeflate vs gzencode
Jul 25th, 2011Really, PHP? You have three different zlib functions for compressing? I’m sure there’s an excellent reason for this but I’ve barely looked at zlib in PHP ever so was a bit surprised at the variety and subtle differences between them.
I happened to pick gzcompress() initially and struggled a bit trying to figure out what was going on – it seems to produce a consistent two byte header of 78 5e, but that is different to what is mentioned in the magic number listing I found – gzip is listed as 1f 8b 08, which is what you’ll see if you use gzencode(). gzdeflate() doesn’t seem to leave a header at all.
This post on Stack Overflow has a little info about the differences; which one you’ll need depends exactly on what you’re doing.
To make things even more awesome though, just after I decided I want to use gzencode(), I discovered that gzdecode() isn’t actually implemented in PHP 5.3 – apparently it is scheduled for PHP 6, so presumably gzencode() is only useful to those who have another mechanism to extract gzip’ed data.
I did a very quick benchmark with about 30 files totaling around 130MB and got the following results using compression option 4, though I tested on 9 and there was little difference:
gzdeflate():
real 0m5.562s
user 0m5.436s
sys 0m0.125s
gzencode():
real 0m5.679s
user 0m5.566s
sys 0m0.111s
gzcompress():
real 0m6.011s
user 0m5.878s
sys 0m0.131s
Comments: 0 --
Recovering Tracks from Broken App ‘My Tracks’ on Android
Jul 8th, 2011I was in Europe recently and recorded a bunch of walking tracks on my Android phone using the very handy application My Tracks from Google. Unfortunately there’s a known problem at the moment where the application won’t start if the phone crashes or loses power while recording – the current theory is that the SQLite database gets corrupted preventing it from loading properly. If you load My Tracks and it just sits there with a black screen and you eventually get the ‘force close’/'wait’/'report’ dialog, you’re probably experiencing this problem.
It is being actively discussed on the My Tracks development mailing list at the moment as several people have been bitten by this and are wondering how to get save their recorded tracks. You can fix the problem simply by deleting all your data, which might work if you don’t care about losing your tracks, but if you’re like me you probably want to hang on to it.
Fortunately a clever cookie by the name of Terry just figured out a workaround to this problem – it involves installing a new application called Locus Free which can (somehow) access your recorded My Tracks tracks, and then export them to KML or GPX.
Based on his instructions I was able to successfully recover all of my recorded tracks, which I’m pretty happy about. Here are the steps I used:
- Install Locus Free from Marketplace
- Go into Phone Settings->Applications and move Locus Free to SD card (I did this so it was easier to access the exported files; I wasn’t sure how else to get them easily)
- Load Locus Free
- Close the first dialog that pops up
- Click the second icon from the top right (cylinder with a floppy disk) to open Data Manager
- Click ‘Export data’ on the Data Manager screen
- Select ‘tracks’ from the small popup
- Change export type to your preferred option (KML for Google Maps/Earth, for example)
- Hit Export
- Access your SD card via your normal mechanism (I use ES Explorer) and browse to /Locus/export and you’ll find the files.
Update 2011-08-05: Gunnar from the MyTracks mailing list also points out the following way to retrieve your data:
However I found that an even quicker way to recover my recorded tracks, including the one that I was recording when the phone died,
was simply to install MyTracks v 1.1.4 over the current v 1.1.7. No uninstall, simply download 1.1.4 and save it to SD-card, then install it from there by double-clicking.I haven’t found any limitations with 1.1.4 compared to 1.1.7 so I happily use 1.1.4 for recording my tracks, including the heart-rate from my Zephyr HxM.
The v1.1.4 version he mentions is available for download here.
Comments: 2 --






