How can I detect and block bots?

Question

For example, if I am on stackoverflow and I refresh my page several times in a row, it starts to think I am a bot and blocks me.

How can I build something like this into my own site?

First, do you need bot detection and mitigation? Maybe you do, or maybe you just want to try it for the experience, but just because stackoverflow has it don't mean you need it. — this.josh, Jun 25 '11 at 05:05
One think to keep in mind are proxyservers. A simple approach may lock out huge numbers of people. — Hendrik Brummermann, Jun 25 '11 at 23:12
@this.josh, we have noticed bots probing us. I think we need it. — JD Isaacks, Jun 27 '11 at 13:38

Rakkhi · Accepted Answer · 2011-06-26T12:38:07.730

10

Use Roboo: http://www.ecl-labs.org/2011/03/17/roboo-http-mitigator.html

Demoed at Blackhat in 2011. Very effective and easy to get up and running. I would recommend over any use of CAPTCHA

"Roboo uses advanced non-interactive HTTP challenge/response mechanisms to detect and subsequently mitigate HTTP robots, by verifying the existence of HTTP, HTML, DOM, Javascript and Flash stacks at the client side.

Such deep level of verification weeds out the larger percentage of HTTP robots which do not use real browsers or implement full browser stacks, resulting in the mitigation of various web threats:

HTTP Denial of Service tools - e.g. Low Orbit Ion Cannon
Vulnerability Scanning - e.g. Acunetix Web Vulnerability Scanner, Metasploit Pro, Nessus Web exploits
Automatic comment posters/comment spam as a replacement of conventional CAPTCHA methods
Spiders, Crawlers and other robotic evil"

Also good measure against DDOS: http://www.rakkhis.com/2011/03/ddos-protection-strategies.html

edited Jun 26 '11 at 12:38

answered Jun 24 '11 at 19:10

Rakkhi

5,783
1
23
47

How does Roboo compare to other solutions, why pick Roboo? – this.josh Jun 25 '11 at 05:08
@this-josh Demoed at blackhat, seemed like some smart guys. I tried it out, easy to setup and seems to work. I haven't done a formal evaluation against other tools but you can if you want – Rakkhi Jun 25 '11 at 07:18
Please elaborate the response to include more info than just a link. Some of us are paranoid about hopping onto a third party page without any background about why we're doing it ;) – Ori Jun 26 '11 at 09:17
@ori good point. Done. Although re link, its me posting so you have my SE rep. Also you can see the full path which you can paste into your browser. Running chrome and no script? – Rakkhi Jun 26 '11 at 12:39
@Rakki you said it tried it out and it was easy to setup. I didn't notice any instructions on the site. Can you add some steps to get it running? Thanks. – JD Isaacks Jun 27 '11 at 13:42
@john-isaacks I'm not a Unix expert but i followed along the readme file and got it up and running Ngix pretty easily. It requires the following Perl modules to be installed: Crypt::Random Math::Pari Net::IP::Match::Regexp To install, copy Roboo.pm and configure Nginx as per the provided example configuration file and instructions above. – Rakkhi Jun 27 '11 at 13:52

score 4 · Answer 2 · answered Jun 26 '11 at 19:40

As everything lately, you can roll your own solution, or use software-as-a-service. There are a few web services that do this sort of thing, if you are comfortable with a possible compromise in visitor's privacy.

The most notable perhaps is Cloudflare which can be configured as basic (free) or advanced (paid) protection. It's a very popular startup lately with lots of sites using it.

It operates as a transparent proxy in front of your website and checks incoming requests according to various criteria, using extensive crowdsourcing about where the data comes from and if someone has seen that IP doing malicious things before. Sources of data include open databases like https://www.projecthoneypot.org. It then presents those users with a challenge page asking to verify, maybe complete a captcha. It categorises 'threats' as botnet zombies, mass spammers etc. In the paid version it also operates as a WAF, trying to catch sql/xss injection attempts.

score 3 · Answer 3 · edited Jun 24 '11 at 18:42

First, you need to assess which parts of your application are at risk of being automated to gain some sort of advantage. For instance, what is to be gained by re-loading that stackoverflow page multiple times? Perhaps a bot by a user to boost the view numbers on their own questions/answers?

The next is determining what behavior constitutes a possible bot. For your stackoverflow example, it would be perhaps a certain number of page loads in a given small time frame from a single user (not just IP based, but perhaps user agent, source port, etc.)

Next, you build the engine that contains these rules, collects tracking data, monitors each request to analyze against the criteria, and flags clients as bots. I would think you would want this engine to run against the web logs and not against live requests for performance reasons, but you could load test this.

I would imagine the system would work like this (using your stackoverflow example): The engine reads a log entry of a web hit, then adds it to its database of webhits, aggregating that hit with all other hits by that unique user on that unique page, and record the timestamp, so that two timestamps get recorded, that of the first hit in the series, and that of the most recent, and the total hit count in the series is incremented.

Then query that list by subtracting the time of the first hit from the time of the last for all series that have a hit count over your threshold. Unique users which fail the check are flagged. Then on the front-end you simply check all hits against that list of flagged users, and act accordingly. Granted, my algorithm is flawed as I just thought it up on the spot.

If you google around, you will find that there is lots of free code in different languages that has this functionality. The trick is thinking up the right rules to flag bot behavior.

This seems like you're talking more about some form of application-level IDS/IPS, not simple bot-blocking. While it's a nice idea, it's much more complex, and IMO it's far out of scope of the original question. — AviD, Jun 26 '11 at 08:48
@AviD, I don't see anything in the original question about *scope* - it's asking how to implement something similar to SO, which appears to use more than just filtering out known bot-names in the header. — John C, Jun 26 '11 at 14:15
Agree with John C. There is nothing simple about truly effective bot-blocking, hence my complicated answer. Plus, OP explicitly stated the word "build" so I didn't mention pre-built stuff like Roboo, etc. — queso, Jun 27 '11 at 22:31
May be much more complex.. but definitely a BETTER solution. — , Jun 28 '11 at 13:27

score 2 · Answer 4 · answered Jun 24 '11 at 17:48

2

It'd be fairly simple to have a service that looks for connection attempts, and whenever multiple attempts happen in a rapid succession from the same IP address it adds that address to a 'black list' of blocked accounts. (Or, if you want to get fancy, launches some application/query to have the person 'verify' they aren't a bot)

Off the top of my head (I've never actually tried this) I'd say you could keep a fixed-size dictionary of IP addresses with the keys being the last time they made a connection attempt and when the times are too close block them.. or alternatively the dictionary sort of 'clears' itself of values older than x amount of time, and you increment each value's key for every connection in that time frame.. if there becomes too many in the 'x' time frame, they are blocked.

A vague answer, I know, but I hope it helps!

answered Jun 24 '11 at 17:48

1

IP address is low hanging fruit, but in practice is ineffective for this. Most large networks use NAT so multiple hosts share few externally facing IP addresses. My university proxied all connections from their internal net through just 2 external IP addresses. In OP's example, Stackoverflow could see hundreds of requests per second from those 2 IP addresses, but it is all valid traffic from dozens of real humans whose access is legit. That's why I discussed in my answer about user agent string, cookie, etc. as other uniquely identifying characteristics. – queso Jun 27 '11 at 22:29
@queso Of course.. there are more indepth ways of doing it. That was just the first thing off the top of my head, basic example. Agent strings would be more precise. On a side note: What university did you go to that only had two external addresses?! Mine uses a class B, and has almost filled it! heh. Though, they are also an ISP for some surrounding businesses/schools I believe. – Jun 28 '11 at 13:25

How can I detect and block bots?

4 Answers4

Linked