4

My Google-fu is failing me right now.

I'm trying to figure out whether Google's web crawler downloads non-image binary files when it spiders sites. I know it downloads (and indexes) images and PDFs, but what about .zip, .dmg, etc?

My client offers a lot of sofware packages for download on their site, and they're trying to figure out whether search engines are making up much of the bandwidth involving these files.

jessica
  • 143
  • 4
  • 2
    Why not just block the files directory in robots.txt and be sure? Even if they don't currently, nothing's stopping them from adding such a feature in the future. – ceejayoz Apr 27 '12 at 21:02
  • Definitely a good idea for the future, but the issue I'm dealing with now is that my client has sent me a list of hits on their downloads, and they want to know whether it's people or web crawlers. I'm trying to figure out how to answer this question regarding their existing/past stats. – jessica Apr 27 '12 at 21:05
  • @ceejayoz Similarly, nothing's stopping them from deciding to ignore binary files in `robots.txt` some day as it's not an access control mechanism, it's just a suggestion that Google voluntarily opts in to. In a similar vein, Google respects `robots.txt` but other search engines do not necessarily. – msanford Apr 27 '12 at 21:30

3 Answers3

7

The answer to your first question seems to be "maybe":

What file types can Google index?

Google can index the content of most types of pages and files. See the most common file types.

But the link to common files types are all text.

Even if you search for binary files like Windows Installers (.msi), you may get a link to a page containing the file or a direct link to the file, but Google almost certainly decides how to index it based on what is around the link on the page, rather than by downloading and deciphering the binary files' contents.

As to your main question, Google's recommended method way for checking whether the bot hit your site or not is to use a reverse-DNS lookup:

$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

Keep in mind that Google's mission "is to organize the world’s information and make it universally accessible and useful." This means that they are constantly innovating, attempting to index non-text data in ways that makes it searchable. To expand on ceejayoz's idea that just because they didn't do it yesterday doesn't mean they won't do it tomorrow: Google will do everything they can to be able to it tomorrow!

msanford
  • 1,427
  • 15
  • 27
  • 2
    I wonder if Google may do a HEAD request for binary files, as well. – ceejayoz Apr 27 '12 at 21:31
  • 2
    Good point, but remember to also forward resolve the host name. Anyone who 'owns' an IP address can set a PTR record to a .googlebot.com domain. – vincent.io Apr 27 '12 at 21:31
  • @vvanscherpenseel Forgot that one! And here I go just doing whatever Google tells me ;) – msanford Apr 27 '12 at 21:33
  • Yes, google does download many binary files. I put a large collection of binary data on the web (squashfs filesystem images, of little or no use to google). I did not ask google to download it all and potentially double my bandwidth costs. Why don't they just read a little (HTTP HEAD) then give up seeing that it is an unknown type of content? – Sam Watkins Oct 16 '14 at 06:07
  • This answer is 2 years old; good (though unfortunate) to know this is happening now... – msanford Oct 16 '14 at 14:24
2

Instead of taking a guess, why not check the access_logs to see what the User Agent or the requesting host is? That way you can even tell how much bandwidth Google (or other crawlers) are taking, by adding the data traffic per request.

vincent.io
  • 935
  • 3
  • 8
  • 23
  • 1
    As I mention in my answer, `user-agents` are trivial to spoof. Google's recommended method is to perform a reverse DNS lookup. – msanford Apr 27 '12 at 21:28
1

I recently noticed an unusual spike in my web server's traffic. Looking at the web stats showed that the small set of large binary files on my site had been downloaded in rapid succession by a group of seemingly-related IP addresses. I used urlquery.net to find out who owns those IPs and found them all to be Google's.

I came here looking for answers, but in reading what others have said, I realized that Google may be scanning binaries for malware, or at least submitting them to malware detection services for scanning. We know that Google detects and flags malware on web sites, so it's reasonable to assume that doing this involves downloading the files in question.

Google's 'If your site is infected' page says this: 'Use the Fetch as Google tool in Webmaster Tools to detect malware'.

Note also that the files in question do not appear in Google's search results, presumably because I use robots.txt to disallow indexing those files. Assuming I'm right, when Google finds a binary file that is linked from a public web page, it will scan the file for malware, regardless of robots.txt, but will only index the file if it's allowed by robots.txt. I think this is exactly what they should be doing, as long as the scanning is infrequent.

Update: Google seems to be doing this every ten days or so. This is going to affect my bandwidth limits.

boot13
  • 185
  • 1
  • 1
  • 9