Differences in Requesting gzip’ed Content using curl in PHP

There are some slight differences in the way curl requests are handled when you’re requested gzip’ed content from a web server. I found these slightly non-obvious, although it’s really pretty clear, but in the interests of trying to clarify I thought I’d write this down.

If you want to use curl to retrieve gzip’ed content from a webserver, you simply do something like this:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
$data = curl_exec($ch);

What I found that was weird was that when I did something like ‘strlen($data)’ after that call, the result clearly indicated that the retrieved data was not compressed – strlen() was reporting 100 kbytes, but when I wget’ed the same page gzip’ed, I could see that it was only around 10 kbytes.

I added the header option to the curl request so I could see what was going on, so the code became:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
curl_setopt($ch, CURLOPT_HEADER, true);
$data = curl_exec($ch);

This yielded something like:

HTTP/1.1 200 OK
Date: Thu, 28 Jul 2011 23:03:42 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: Mono
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 11091
Connection: close
Content-Type: text/html; charset=UTF-8

So the web server thought was clearly returning a compressed document, as it matched the ~10 kbyte figure I was seeing with wget, but the actual size of the $data variable was out of whack with this.

As it turns out, CURLOPT_ENCODING actually also controls whether the curl request decodes the response from the webserver. So in addition to setting the required header for the request, it also transparently decompresses it so you can deal directly with the uncompressed content. Upon reflection, this is a little obvious if you just read the manual page.

Basically, the problem was that I was expecting (and wanting) to get a binary chunk of compressed data. This was not the case, but what curl was doing worked out fine for me anyway.

However, I did figure out how to get the binary chunk that I was initially wanting. Basically instead of using the CURLOPT_ENCODING option, you just add a header to the request and set binary transfer mode on, so the code simply becomes:

$headers[] = "Accept-Encoding: gzip";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$data = curl_exec($ch);

This will return the gzip’ed chunk of binary gibberish to $data (which, of course, will be much smaller when you run strlen() on it).

2 thoughts on “Differences in Requesting gzip’ed Content using curl in PHP”

  1. HI,
    I have a small doubt here . how to handle when only some of the pages have gzipped content and other pages does not have the gzipped content.

  2. You can peek at the content to see if it is gzip’ed before decompressing it – the first few bytes of it will have some gzip header (I can’t recall what it is but it should be findable with a quick search).

Leave a Reply

Your email address will not be published. Required fields are marked *