I've setup new Amazon EC2 instance. In a day or two started to get strange "GET" requests from the "google bot-like" IP's (eg 66.249.76.84, 66.249.74.152) about one in 10 seconds (some examples):
66.249.74.152 - - [10/Apr/2013:06:05:02 +0000] "GET /play/gp4GbjXBD4B3?sh=04f2fd19ae2dd623e7135d29a1894f03&sh=f172a32c89190e28f9c27123d7c6cf43&sh=04f2fd19ae2dd623e7135d29a1894f03 HTTP/1.1" 404 295 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.76.84 - - [11/Apr/2013:03:51:44 +0000] "GET /api/levels/2ry7ZAh0Y91r HTTP/1.1" 404 295 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
They are checking hashes in folders like
/play/'some_hash_here'
/profile/'some_hash_here'
/level/'some_hash_here'
/api/'some_hash_here'
I never had such folders on this site. But to do something with this I've tried to block them in robots.txt
User-agent: *
Disallow:
Crawl-delay: 120
Disallow: /play
Disallow: /profile
Disallow: /level
But it didn't help at all, it just don't read robots.txt. To get rid of all the mess that they provided in my error_log file, I've created rules in .htaccess file like this
Redirect 301 /play 'some_other_site'
Redirect 301 /level 'some_other_site'
Redirect 301 /profile 'some_other_site'
Redirect 301 /api 'some_other_site'
Moreover, I found some traces of the real google bot that crawled my site, and it's behavior was very normal: it requested only pages that had links on pages of my site. How can I get rid of such fraud scanning?