3

Our websites are being crawled by content thieves on a regular basis. We obviously want to let through the nice bots and legitimate user activity, but block questionable activity.

We have tried IP blocking at our firewall, but this becomes to manage the block lists. Also, we have used IIS-handlers, however that complicates our web applications.

Is anyone familiar with network appliances, firewalls or application services (say for IIS) that can reduce or eliminate the content scrapers?

John Gardeniers
  • 27,262
  • 12
  • 53
  • 108
drodecker
  • 51
  • 4

3 Answers3

2

If the scrapers are BOTS and not humans, you could try creating a honeypot directory that they would crawl to and be blocked (by IP address) automatically via a "default page" script in that directory. Humans could easily unblock themselves, but it would thwart bots as they would get a 403 "not authorized" error on any further access. I use a technique like this to block bad robots that disobey robots.txt, but not permanently block humans who either share the same IP or "accidentally" navigate to the blocking script. That way, if a shared IP gets blocked, it's not permanent. Here's how:

I set up a default (scripted) page in one or more subdirectories (folders) blocked in robots.txt. That page, if loaded by a misbehaving robot -- or a snooping human -- adds their IP address to a blocked list. But I have a 403 ("not authorized") error-handler that redirects these blocked IPs to a page explaing what's going on and containing a captcha that a human can use to unblock the IP. That way, if an IP is blocked because one person used it one time for a bad purpose, the next person to get that IP won't be permanently blocked -- just inconvenienced a little. Of course, if a particular IP keeps getting RE-blocked a lot, I can take further steps manually to address that.

Here is the logic:

  1. If IP not blocked, allow access normally.
  2. If visitor navigates to forbidden area, block their IP.
  3. If IP is blocked, redirect all access to the "unblock" form containing the captcha.
  4. If user manually enters proper captcha, remove the IP from the blocked list (and log that fact).
  5. Rinse, lather, REPEAT above steps for further accesses.

That's it! One script file to handle the block notice and unblock captcha submission. One entry (minimum) in the robots.txt file. One 403 redirection in the htaccess file.

Rob W
  • 66
  • 2
0

Check the request headers? Depending on whether they are kiddies or not, it may be enough

Eddy
  • 296
  • 1
  • 9
  • I agree, the request headers may need to be evaluated. Rather than having to program this logic, I'm wondering if there are any product solutions available. – drodecker Mar 03 '10 at 07:08
  • can do on Apache itself: http://www.modsecurity.org/ – Eddy Mar 04 '10 at 11:13
0

You want a hardware firewall that does HTTP inspection. This won't come cheap, I'm afraid.

I seem to recall that a Cisco ASA 5520 will do this, but the list price for one of these is about £4600 ~= $6900.

You could probably do something similar with a linux box running a firewall app, for a fraction of the cost.

Tom O'Connor
  • 27,440
  • 10
  • 72
  • 148
  • Sounds like it may be a suitable solution. The data of knowing good vs. bad bots though would be the service I'm really looking for. Are you familiar with a good linux routing solution? – drodecker Mar 04 '10 at 08:17