2

A webcrawler has bought our site down twice. It ignores our robots.txt and we have had no reply from their customer services or support using both e-mail and twitter.

I have had to create url redirect based on their user agent string, I have redirected all their requests back to their own public website. Is this the right thing to do?

Edit How do I return a 40? error code based on user agent string using tomcat/tuckey? (Our site is hosted on a Windows server, if that matters.) I can't use IP addresses as the bot uses many (it's grid based apparently).

This is partly due to our website being an old and creaky legacy system, but Google's crawler and Bing's crawler do not knock us over, and our normal business traffic is just fine. A significant investment/development to handle one bot is not sensible.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
NimChimpsky
  • 460
  • 2
  • 5
  • 17
  • I'd have banned the crawler with a `401`, `403` or `429`. Or maybe we can get a new status code? Something like `437 - Bad Bot` or so... – Bobby Jun 15 '12 at 08:41
  • @Bobby that might be better(less fun though). I am using tuckey url rewrite with tomcat, any pointers on how to do what you suggested ? – NimChimpsky Jun 15 '12 at 08:45
  • I've never used that...so sorry. But if you parse the Agent String, it should be possible to simply jump out with the HTTP Status instead of doing the rewrite. – Bobby Jun 15 '12 at 08:47
  • If the bot is useful, you might be better off forcibly rate-limiting it with `iptables`. If it isn't useful, 403 or an outright block with iptables is the best bet. Bots don't follow redirects in the same way that browsers do so giving them 1000 redirects to their own home page only results in the bot requesting that page once. (Or whatever firewall you use since I see you're on Windows.) – Ladadadada Jun 15 '12 at 09:35

1 Answers1

3

A webcrawler has bought our site down twice

If a webcrawler can bring your site down then they've demonstrated that your site is very vulnerable to DOS. While yes, a quick fix is to block that webcrawler's access, it doesn't really provide you much protection against other web crawlers / DOS / high volumes of legitimate traffic.

I agree with Bobby - where you know that the request are from a badly behaved client, then the right response is a 4xx error code - but you can put any status message in the response - and should repeat it in the body. I don't think it needs a new status code - 409 seems to address the situation.

Really you should be looking at how to handle such traffic more gracefully - minimum bandwidth guarantee is more effective than bandwidth caping, but is more rare than the latter. Limiting the number of connections and connection rate per IP address is a good approach too (but beware of IPv6 PoP issues if you're on IPv4).

If you want an adaptive solution running in userspace, (assuming this is on Linux / BSD) have a look at fail2ban.

Restricting bandwidth / conenctions are still remediation though - a better solution is to improve the performance/capacity of your system.

symcbean
  • 19,931
  • 1
  • 29
  • 49
  • the bot uses grid computing, lots of different ip's. How do can I instead return an error code with tomcat/tuckey, instead of redirecting ? Its not linux, its windows. – NimChimpsky Jun 15 '12 at 09:20
  • "Really you should be looking at how to handle such traffic more gracefully" That's a business decision not a technical issue. – NimChimpsky Jun 15 '12 at 09:45