2

I've written a short program that uses a webbrowser to visit many pages of a website (on which users can sell/buy items - kind of like a specialist ebay). However after 20 or so page views I get directed to a flow control page - "you've made too many requests recently".

I don't actually have any issue with spreading out my requests, this is for a pet project nothing commercial. However I figured I'd try and bypass it so I can be sure I don't miss pages (my plan is to do exactly 20 pages every hour which at the moment seems to be below the boundary). At first I thought it might just be cookies, so I deleted them after each request. The I thought it might be something to do with the last page I was on (ie. I move to page 2 and it knows I've come from page 1 etc.). so I sent the webbrowser control to about blank and in between requests (no success here either).

So how does a website track users for flow control purposes? It can't be IP based because I can still access the site using a different browser (chrome as opposed to an embedded IE).

FraserOfSmeg
  • 163
  • 5
  • Maybe the site is tracking what browser agents contact it? – xorist Mar 11 '16 at 20:56
  • Not really a duplicate, but possible answers might be similar to answers for [How to uniquely identify users with the same external IP address?](https://security.stackexchange.com/questions/81302/how-to-uniquely-identify-users-with-the-same-external-ip-address) – tim Mar 11 '16 at 21:06

1 Answers1

1

There's many ways to enforce request rate limit, so I'll mostly focus here on what appears to be going on in your case;

It's highly unlikely, considering it's a commerce site, that you're dealing with such an anal retentive IDS (Intrusion Detection System) setup to block anything that remotely looks like a crawler and blocks (redirects in your case) requests on network level. Even though such systems are capable of request rate limiting, on sites like you describe, those would be setup a lot more permissively and only try to prevent illegal requests, request flooding, and similar. 20 GET requests per hour isn't that.

What's more likely going on is that you've hit a pesky WAF (Web Application Firewall) which acts on the application level and would care to fingerprint your browser and compare consecutive requests to establish if you're a crawler, then flag your requests as suspicious if they follow certain pattern that it is setup (or even trained, if you're dealing with a heuristic WAF) to detect. Such as, say, requesting many subsequent pages in the same order as they appear in the source, but are missing this source page (or the domain alone in case of HTTPS) in the GET request header's referer [sic, it's mispelled in HTTP specs] field. This would happen if you scripted your browser to request each new page independently, as if you typed its URL in the address bar and pressed enter, and is one of the safest methods (low chance of false positives) of detecting automated crawlers (aka bots), especially if you repeatedly request long and messy URLs and they otherwise use human readable versions of URLs, or ones equipped with additional URI parameters (referrer schemes, user tracking,...) in emails and everywhere else where users might access more of them in a short period of time without opening them through some landing page.

In short, if you'll want to avoid this detection of automated crawlers, you'll either have to inspect and obey rules in their robots.txt page, request from site admin to lax the rules for your crawler and make sure they can easily identify it (such as through user agent string), or make your script such that it mimics a human user better. Penultimate paragraph should help you avoid one of the most common bot detection techniques that seems to apply in your case, but you might hit additional ones as your request rate increases.

TildalWave
  • 10,801
  • 11
  • 45
  • 84
  • @TidalWade fancy seeing you here! Thanks for the answer. I've just received a shed load of work emails, but when I get a chance to play around with implementing some changes I'll let you know how it goes! :D – FraserOfSmeg Mar 11 '16 at 23:53