IP address of spiders and “official” web bots

Question

Is there an official API to iplists.com from where I can get the list of spiders?

My intention is to whitelist these IPs for site scraping.

Does `robots.txt` not work as expected for these bots? I would guess the bots IPs could change without notice. — jscott, Mar 14 '12 at 03:05
Ask them if there's an official anything. I don't see mention of one on the site... — Bill Weiss, Mar 14 '12 at 03:06
These lists where updated last four years ago. I assume it's safe to consider this resource dead and irrelevant. — Sven, Mar 14 '12 at 03:07
@jscott I think the problem he's trying to solve is "I want to allow bots to site scrape / crawl as fast as they want, but I want to stop *other people* from snarfing my content (bloody leeches!)" -- If that's the case `robots.txt` won't help because the leeches will just ignore it :-/ — voretaq7, Mar 14 '12 at 03:08

score 8 · Answer 1 · answered Mar 14 '12 at 03:06

Not that I know of, and it could change at any time at the discretion of the bot operators.

Google offers some specific guidance and explanation on this:

The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range.

and they suggest using a DNS check (forward and reverse) to verify:

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

This is probably the best general advice, but it is somewhat resource intensive (CPU cycles for DNS lookups).

As an aside, this works great in theory, but it can fail spectacularly in practice. For instance, it is not possible to do this _reliably_ in PHP due to a [design flaw](https://bugs.php.net/bug.php?id=53092) in the language. — Michael Hampton, Oct 15 '12 at 03:57

score 2 · Answer 2 · edited May 23 '17 at 12:41

There's no list of IP addresses for "good" search engine bots that I know of, and if there were it would be horribly out of date pretty quickly, as you've already discovered.

One thing you can do is to create a bot trap. This is simple in theory: You create a page that is linked to in your web site but hidden from normal users (e.g. via CSS tricks) and then Disallow it in robots.txt. You then wait a week since legitimate search engines may cache robots.txt for that long, then start banning anything that hits the trap page (e.g. with fail2ban).

score 1 · Answer 3 · answered Aug 30 '22 at 07:56

1

Google bot: https://developers.google.com/search/apis/ipranges/googlebot.json

Bing bot: https://www.bing.com/toolbox/bingbot.json

Facebook https://developers.facebook.com/docs/sharing/webmasters/crawler/

answered Aug 30 '22 at 07:56

Chu Khanh Van

111
2

IP address of spiders and “official” web bots

3 Answers3