Firewall - Preventing Content Theft & Rogue Crawlers

Question

Our websites are being crawled by content thieves on a regular basis. We obviously want to let through the nice bots and legitimate user activity, but block questionable activity.

We have tried IP blocking at our firewall, but this becomes to manage the block lists. Also, we have used IIS-handlers, however that complicates our web applications.

Is anyone familiar with network appliances, firewalls or application services (say for IIS) that can reduce or eliminate the content scrapers?

score 2 · Accepted Answer · answered Mar 11 '10 at 05:40

If the scrapers are BOTS and not humans, you could try creating a honeypot directory that they would crawl to and be blocked (by IP address) automatically via a "default page" script in that directory. Humans could easily unblock themselves, but it would thwart bots as they would get a 403 "not authorized" error on any further access. I use a technique like this to block bad robots that disobey robots.txt, but not permanently block humans who either share the same IP or "accidentally" navigate to the blocking script. That way, if a shared IP gets blocked, it's not permanent. Here's how:

I set up a default (scripted) page in one or more subdirectories (folders) blocked in robots.txt. That page, if loaded by a misbehaving robot -- or a snooping human -- adds their IP address to a blocked list. But I have a 403 ("not authorized") error-handler that redirects these blocked IPs to a page explaing what's going on and containing a captcha that a human can use to unblock the IP. That way, if an IP is blocked because one person used it one time for a bad purpose, the next person to get that IP won't be permanently blocked -- just inconvenienced a little. Of course, if a particular IP keeps getting RE-blocked a lot, I can take further steps manually to address that.

Here is the logic:

If IP not blocked, allow access normally.
If visitor navigates to forbidden area, block their IP.
If IP is blocked, redirect all access to the "unblock" form containing the captcha.
If user manually enters proper captcha, remove the IP from the blocked list (and log that fact).
Rinse, lather, REPEAT above steps for further accesses.

That's it! One script file to handle the block notice and unblock captcha submission. One entry (minimum) in the robots.txt file. One 403 redirection in the htaccess file.

That a good rudimentary solution. Thanks for sharing. – drodecker Mar 12 '10 at 06:45 — drodecker, Mar 12 '10 at 06:45

score 0 · Answer 2 · answered Mar 03 '10 at 06:41

0

Check the request headers? Depending on whether they are kiddies or not, it may be enough

answered Mar 03 '10 at 06:41

Eddy

296
1
9

I agree, the request headers may need to be evaluated. Rather than having to program this logic, I'm wondering if there are any product solutions available. – drodecker Mar 03 '10 at 07:08
can do on Apache itself: http://www.modsecurity.org/ – Eddy Mar 04 '10 at 11:13

score 0 · Answer 3 · answered Mar 03 '10 at 07:30

0

You want a hardware firewall that does HTTP inspection. This won't come cheap, I'm afraid.

I seem to recall that a Cisco ASA 5520 will do this, but the list price for one of these is about £4600 ~= $6900.

You could probably do something similar with a linux box running a firewall app, for a fraction of the cost.

answered Mar 03 '10 at 07:30

Tom O'Connor

27,440
10
72
148

Sounds like it may be a suitable solution. The data of knowing good vs. bad bots though would be the service I'm really looking for. Are you familiar with a good linux routing solution? – drodecker Mar 04 '10 at 08:17

Firewall - Preventing Content Theft & Rogue Crawlers

3 Answers3