3

Is there an official API to iplists.com from where I can get the list of spiders?

My intention is to whitelist these IPs for site scraping.

Bill the Lizard
  • 352
  • 1
  • 7
  • 15
Quintin Par
  • 4,293
  • 10
  • 46
  • 72
  • Does `robots.txt` not work as expected for these bots? I would guess the bots IPs could change without notice. – jscott Mar 14 '12 at 03:05
  • Ask them if there's an official anything. I don't see mention of one on the site... – Bill Weiss Mar 14 '12 at 03:06
  • 1
    These lists where updated last four years ago. I assume it's safe to consider this resource dead and irrelevant. – Sven Mar 14 '12 at 03:07
  • 5
    @jscott I think the problem he's trying to solve is "I want to allow bots to site scrape / crawl as fast as they want, but I want to stop *other people* from snarfing my content (bloody leeches!)" -- If that's the case `robots.txt` won't help because the leeches will just ignore it :-/ – voretaq7 Mar 14 '12 at 03:08

3 Answers3

8

Not that I know of, and it could change at any time at the discretion of the bot operators.

Google offers some specific guidance and explanation on this:

The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range.

and they suggest using a DNS check (forward and reverse) to verify:

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

This is probably the best general advice, but it is somewhat resource intensive (CPU cycles for DNS lookups).

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • 1
    As an aside, this works great in theory, but it can fail spectacularly in practice. For instance, it is not possible to do this _reliably_ in PHP due to a [design flaw](https://bugs.php.net/bug.php?id=53092) in the language. – Michael Hampton Oct 15 '12 at 03:57
2

There's no list of IP addresses for "good" search engine bots that I know of, and if there were it would be horribly out of date pretty quickly, as you've already discovered.

One thing you can do is to create a bot trap. This is simple in theory: You create a page that is linked to in your web site but hidden from normal users (e.g. via CSS tricks) and then Disallow it in robots.txt. You then wait a week since legitimate search engines may cache robots.txt for that long, then start banning anything that hits the trap page (e.g. with fail2ban).

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940