Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
Not that I know of, and it could change at any time at the discretion of the bot operators.
Google offers some specific guidance and explanation on this:
The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range.
and they suggest using a DNS check (forward and reverse) to verify:
Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:
$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.
This is probably the best general advice, but it is somewhat resource intensive (CPU cycles for DNS lookups).
There's no list of IP addresses for "good" search engine bots that I know of, and if there were it would be horribly out of date pretty quickly, as you've already discovered.
One thing you can do is to create a bot trap. This is simple in theory: You create a page that is linked to in your web site but hidden from normal users (e.g. via CSS tricks) and then Disallow
it in robots.txt
. You then wait a week since legitimate search engines may cache robots.txt
for that long, then start banning anything that hits the trap page (e.g. with fail2ban).