How do large sites (e.g. Wikipedia) deal with bots that are behind other IP masker? For instance, in my university, everybody searches Wikipedia, giving it a significant load. But, as far as I know, Wikipedia can only know the IP of the university router, so if I set up an "unleashed" bot (with only a small delay between requests), can Wikipedia ban my bot without banning the whole organization? can a site actually ban an IP behind an organizational network?
-
8[The day Wikipedia banned Qatar](https://en.wikinews.org/wiki/Qatari_proxy_IP_address_temporarily_blocked_on_Wikipedia). – isanae Apr 19 '16 at 01:51
-
@isanae Related: http://superuser.com/q/1013630/326546 – kasperd Apr 19 '16 at 08:29
-
Better make your bot [indistinguishable from legitimate users](https://xkcd.com/810/) – Hagen von Eitzen Apr 19 '16 at 17:10
3 Answers
No, they'll ban the public IP and everyone who is NAT'd to that IP will also be banned.
Although at least At stack if we think we are going to ban a college or something like that we'll reach out to their abuse contact to get them to track the offender down and stop the issue.
- 36,995
- 5
- 52
- 95
-
2What Zypher said. Speaking as someone who used to track down complaints sent to abuse@unnamedacademicinstitution.edu, we were usually pretty eager to find the person responsible so they would unblock the public IP. (College students *love* to share music peer to peer. RIAA loves to contact abuse@whatever.edu about it.) – Katherine Villyard Apr 18 '16 at 18:18
-
...unless there is something uniquely identifiable about your bot, such as passing an access token or a unique browser id. – simpleuser Apr 18 '16 at 22:13
-
1This doesn't answer the actual title question of how these sites *detect* bots. In fact, it seems that if you slow down your bot sufficiently (which wouldn't be much), it would in fact be indistinguishable from valid usage by a whole bunch of college students. – Wildcard Apr 18 '16 at 23:47
-
1To extend on @KatherineVillyard's comment. Formally overseeing the network of an institution if no one reached out to us prior to blocking, and the resource we were block from was regularly used, we would reach out to them to correct the problem. Usually they were willing to unblock us if we would resolve this from our end. This meant pursing the source of abuse. Being Wikipedia, even if they don't reach out to your institution your institution will likely look into it once they realize they have been blacklisted. That seemingly harmless ban can quickly turn into an expulsion. – Bacon Brad Apr 19 '16 at 01:04
-
1@Wildcard FWIW most places won't tell you how they detect bots simply because that will just get the bot authors they are catching to change things up. That said, there are many other signals besides velocity of requests to detect bots. But most places won't care that much if you are playing nice, not doing something shitty or straining resources. It just isn't worth it to chase every small bot out there. – Zypher Apr 19 '16 at 16:36
A site cannot directly ban an IP which is behind NAT. It could act on IPs passed through non-anonymising HTTP proxies - when such a proxy forwards a request on, it typically appends that address to an X-Forwarded-For header, so if access from your private network actually has to go via such a proxy the internal IP could be exposed; however most sites (wikipedia included) wouldn't trust the information in that header anyway because it's easy to spoof to implicate innocent IPs or evade bans.
There are other techniques that attempt to uniquely identify users independently of IP address however. You can interrogate a web browser for a lot of information about it and the system it's running on, such as the user-agent, screen resolution, list of plugins, etc. - see https://github.com/carlo/jquery-browser-fingerprint for an example of this in practice. You could use such fingerprints to control access, though depending on site design you may be able to interact with it without engaging with the fingerprinting process, and even if you can't a bot could provide spurious and randomised data in order to avoid having a consistent fingerprint if you are aware this kind of protection is in place. This method of control also runs the risk of false positives especially when it comes to mobile devices where there will probably be large numbers of clients running identical stock clients on identical stock hardware (most people on a specific model of iPhone running a specific version of iOS, for instance, would probably get the same fingerprint). Fingerprinting like this is normally just used for user tracking rather than to enforce controls but I am aware of places which do use fingerprinting to implement bans when there is concern that an IP block would be too broad, and could be effective against a naive bot.
- 919
- 5
- 12
-
1It's not unlikely at all; many universities, and at least one entire country, proxy web connections and add X-Forwarded-For. – Michael Hampton Apr 19 '16 at 00:18
-
Interesting. I would personally be surprised if a company were to configure their web proxies to do that as it exposes some (admittedly trivial) information about your internal network, but I guess it depends on the org. – Carcer Apr 19 '16 at 08:09
-
@Carcer, it does not have to be the real internal IP address, just something that is consistent for each user of the proxy. – Ian Ringrose Apr 19 '16 at 12:09
Generally the IP address isn't sufficient information for a correct ban. So advanced networks work high up the network stack.
A Denial of Service (DoS) attack (which you are worried about creating) is usually handled by rate limiting the initial TCP connection setup. This means legitimate users that are willing to wait will get through whereas those that are just trying to consume server resources are slowed to the point it become harmless. This is where DoS then evolved into a Distributed DoS (DDoS) attack.
Once you have a connection to the server you can make as many requests as you like, the web server administration can configure how many requests to handle.
The web server probably can handle more capacity than your local network gateway anyway, that's probably the limiting factor in your use case. I'd wager your University network admins would come knocking on your door before Wikipedia did.
It is important to be a good Internet citizen so I would add rate limiting code to a bot.
It should also be pointed out that Wikipedia offer data dumps so that trawling the site isn't really necessary.
- 675
- 2
- 10
- 21