2

I've setup new Amazon EC2 instance. In a day or two started to get strange "GET" requests from the "google bot-like" IP's (eg 66.249.76.84, 66.249.74.152) about one in 10 seconds (some examples):

66.249.74.152 - - [10/Apr/2013:06:05:02 +0000] "GET /play/gp4GbjXBD4B3?sh=04f2fd19ae2dd623e7135d29a1894f03&sh=f172a32c89190e28f9c27123d7c6cf43&sh=04f2fd19ae2dd623e7135d29a1894f03 HTTP/1.1" 404 295 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"    
66.249.76.84 - - [11/Apr/2013:03:51:44 +0000] "GET /api/levels/2ry7ZAh0Y91r HTTP/1.1" 404 295 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

They are checking hashes in folders like

/play/'some_hash_here'
/profile/'some_hash_here'
/level/'some_hash_here'
/api/'some_hash_here'

I never had such folders on this site. But to do something with this I've tried to block them in robots.txt

User-agent: *
Disallow: 
Crawl-delay: 120
Disallow: /play
Disallow: /profile
Disallow: /level

But it didn't help at all, it just don't read robots.txt. To get rid of all the mess that they provided in my error_log file, I've created rules in .htaccess file like this

Redirect 301 /play 'some_other_site'
Redirect 301 /level 'some_other_site'
Redirect 301 /profile 'some_other_site'
Redirect 301 /api 'some_other_site'

Moreover, I found some traces of the real google bot that crawled my site, and it's behavior was very normal: it requested only pages that had links on pages of my site. How can I get rid of such fraud scanning?

domage
  • 23
  • 2

2 Answers2

0

Ok. I don't know what it was, and I don't know what does it wanted, but I think I found a solution on the basis of fail2ban package.

domage
  • 23
  • 2
0

Those IPs are Google IPs, so chances are they're legitimate GoogleBot hits.

I wouldn't worry about them. They're unlikely to be hacking attempts. Rather, the most likely situation is that your server's IP was previously that of another website that had these URLs. This is fairly common on Amazon EC2 because of the floating nature of their IP addresses.

ceejayoz
  • 32,469
  • 7
  • 81
  • 105
  • You are correct. I've just found the IP of the server in the google results and and everything became clear. I hope it will read robots.txt some day. – domage Apr 12 '13 at 04:26