For example, if I am on stackoverflow and I refresh my page several times in a row, it starts to think I am a bot and blocks me.
How can I build something like this into my own site?
For example, if I am on stackoverflow and I refresh my page several times in a row, it starts to think I am a bot and blocks me.
How can I build something like this into my own site?
Use Roboo: http://www.ecl-labs.org/2011/03/17/roboo-http-mitigator.html
Demoed at Blackhat in 2011. Very effective and easy to get up and running. I would recommend over any use of CAPTCHA
"Roboo uses advanced non-interactive HTTP challenge/response mechanisms to detect and subsequently mitigate HTTP robots, by verifying the existence of HTTP, HTML, DOM, Javascript and Flash stacks at the client side.
Such deep level of verification weeds out the larger percentage of HTTP robots which do not use real browsers or implement full browser stacks, resulting in the mitigation of various web threats:
Also good measure against DDOS: http://www.rakkhis.com/2011/03/ddos-protection-strategies.html
As everything lately, you can roll your own solution, or use software-as-a-service. There are a few web services that do this sort of thing, if you are comfortable with a possible compromise in visitor's privacy.
The most notable perhaps is Cloudflare which can be configured as basic (free) or advanced (paid) protection. It's a very popular startup lately with lots of sites using it.
It operates as a transparent proxy in front of your website and checks incoming requests according to various criteria, using extensive crowdsourcing about where the data comes from and if someone has seen that IP doing malicious things before. Sources of data include open databases like https://www.projecthoneypot.org. It then presents those users with a challenge page asking to verify, maybe complete a captcha. It categorises 'threats' as botnet zombies, mass spammers etc. In the paid version it also operates as a WAF, trying to catch sql/xss injection attempts.
First, you need to assess which parts of your application are at risk of being automated to gain some sort of advantage. For instance, what is to be gained by re-loading that stackoverflow page multiple times? Perhaps a bot by a user to boost the view numbers on their own questions/answers?
The next is determining what behavior constitutes a possible bot. For your stackoverflow example, it would be perhaps a certain number of page loads in a given small time frame from a single user (not just IP based, but perhaps user agent, source port, etc.)
Next, you build the engine that contains these rules, collects tracking data, monitors each request to analyze against the criteria, and flags clients as bots. I would think you would want this engine to run against the web logs and not against live requests for performance reasons, but you could load test this.
I would imagine the system would work like this (using your stackoverflow example): The engine reads a log entry of a web hit, then adds it to its database of webhits, aggregating that hit with all other hits by that unique user on that unique page, and record the timestamp, so that two timestamps get recorded, that of the first hit in the series, and that of the most recent, and the total hit count in the series is incremented.
Then query that list by subtracting the time of the first hit from the time of the last for all series that have a hit count over your threshold. Unique users which fail the check are flagged. Then on the front-end you simply check all hits against that list of flagged users, and act accordingly. Granted, my algorithm is flawed as I just thought it up on the spot.
If you google around, you will find that there is lots of free code in different languages that has this functionality. The trick is thinking up the right rules to flag bot behavior.
It'd be fairly simple to have a service that looks for connection attempts, and whenever multiple attempts happen in a rapid succession from the same IP address it adds that address to a 'black list' of blocked accounts. (Or, if you want to get fancy, launches some application/query to have the person 'verify' they aren't a bot)
Off the top of my head (I've never actually tried this) I'd say you could keep a fixed-size dictionary of IP addresses with the keys being the last time they made a connection attempt and when the times are too close block them.. or alternatively the dictionary sort of 'clears' itself of values older than x amount of time, and you increment each value's key for every connection in that time frame.. if there becomes too many in the 'x' time frame, they are blocked.
A vague answer, I know, but I hope it helps!