HTTP responses curl and wget different results

To check HTTP response header for a set of urls I send with curl the following request headers

foreach ( $urls as $url )
{
    // Setup headers - I used the same headers from Firefox version 2.0.0.6
    $header[ ] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[ ] = "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[ ] = "Cache-Control: max-age=0";
    $header[ ] = "Connection: keep-alive";
    $header[ ] = "Keep-Alive: 300";
    $header[ ] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[ ] = "Accept-Language: en-us,en;q=0.5";
    $header[ ] = "Pragma: "; // browsers keep this blank.

    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
    curl_setopt( $ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt( $ch, CURLOPT_REFERER, 'http://www.google.com');
    curl_setopt( $ch, CURLOPT_HEADER, true );
    curl_setopt( $ch, CURLOPT_NOBODY, true );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY );
    curl_setopt( $ch, CURLOPT_TIMEOUT, 10 ); //timeout 10 seconds
}

Sometimes I receive 200 OK which is good other time 301, 302, 307 which I consider good as well, but other times I receive weird status as 406, 500, 504 which should identify an invalid url but when I open it on the browser they are fine

for example the script returns

http://www.awe.co.uk/ => HTTP/1.1 406 Not Acceptable

and wget returns

wget http://www.awe.co.uk/
--2011-06-23 15:26:26--  http://www.awe.co.uk/
Resolving www.awe.co.uk... 77.73.123.140
Connecting to www.awe.co.uk|77.73.123.140|:80... connected.
HTTP request sent, awaiting response... 200 OK

Does anyone know which request header I am missing or adding in excess?

Fab

Posted 2011-06-23T14:32:37.177

Reputation: 41

Answers

You are including invalid HTTP headers in your request:

$header[ ] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[ ] = "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";

On the first line, the list ends with a , – that is, an empty content type – which is the cause of 406 Not acceptable errors. The second line is not even a HTTP header.

If you were looking at Firefox HTTP conversations with a packet sniffer, you probably saw something like this:

Accept: text/xml,application/xml,application/xhtml+xml,
    text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Since the second line starts with whitespace, they are treated as a single header by the server. They must also be passed as one header to curl:

$header[] = "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";

You can use http://echo.opera.com to compare the requests being sent.

user1686

Posted 2011-06-23T14:32:37.177

Reputation: 283 655

You're not providing a Host: header in your header[] array. In HTTP 1.1 requests to content servers, Host: headers are mandatory. The non-4xx responses are where you've happened to hit someone's content HTTP server that is forgiving when it comes to this protocol error.

JdeBP

Posted 2011-06-23T14:32:37.177

Reputation: 23 855

As the header is mandatory, curl includes it automatically (and if it didn't, the response would be 400 Bad request). – user1686 – 2011-06-23T17:33:53.203

1Not true in practice. It only sometimes invents one if one hasn't provided one, depending from what is passed to other options and to curl_init(), which we haven't been told. And, as should be obvious from the data in the question even if one has never encountered it in practice, not everyone gets the error responses for incorrect protocol right. – JdeBP – 2011-06-24T09:20:04.607

In my humble opinion your script looks ok, and since you are sometimes getting the correct results it should be working.

Are you the owner of http://www.awe.co.uk/?
Maybe there is a script running that decides what to do depending on some env's. For example in your script you are accessing this site as user-agent "googlebot" while your wget user-agent will be "wget". The script on the web server may check if it is Google and delivers some completely different content than your browser might see. In the same way the web server may send different return codes.
To test this issue, you might want to reduce your script, or extend the wget command to send the same request and compare the results.

Another thing I can imagine: How often did you run your script? Maybe the web server noticed the huge traffic from your script and sends 406 (or some other stuff) if you are exaggerating ;-)

binfalse

Posted 2011-06-23T14:32:37.177

Reputation: 1 426