22

I saw this on CloudFlares homepage:

CloudFlare protects against a range of threats: cross site scripting, SQL injection, comment spam, excessive bot crawling, email harvesters, and more.

How could a company like CloudFlare block crawler bots and email harvester? I asume they are smart enough not to use User-Agent: Evil-Email-Harvester. So how do you differentiate a bot like an email harvester from a normal user?

I guess you could see that it is some kind of bot because you get requests for multiple sites from the same IP. But that would also be the case for many legit IPs, like a VPN. How do you tell the good from the bad?

Anders
  • 64,406
  • 24
  • 178
  • 215
  • 26
    [RFC 3514](https://www.ietf.org/rfc/rfc3514.txt) compliant bots will have the evil bit set, so they can filter on that. – tarleb Mar 03 '16 at 12:39

3 Answers3

23

CloudFlare serves as a guard between your webserver and the client. Every content the client receives got provided by your webserver and filtered by CloudFlare. This way, CloudFlare obfuscates email addresses by filtering them using a regex before delivering it to the client.

If your website contains the email

<a href="mailto:s@scha.bz">s@scha.bz</a>

CloudFlare will replace it with

<a href="/cdn-cgi/l/email-protection#fed8ddcfcfcbc5d8ddc8cac5d8ddcfcfcbc5d8ddc7c7c5d8ddcfcecac5d8ddc7c9c5d8ddcac8c5d8ddc7c6c5d8ddcfccccc5">&#115;&#64;&#115;&#99;&#104;&#97;&#46;&#98;&#122;</a>

The /cdn-cgi/ - folder, though it still points to the webserver, is only for CloudFlare which will automatically filter everything you submit, deobfuscating and returning the correct email address.

Of course this is not bulletproof (this is simply not possible) as a bot can continue on that URL or search for encoded email - patterns. This is a rare occurence and most of todays simple crawlers wont find your email.

You shouldnt rely on this approach - CF is already quite popular and it is easy to detect and deobfuscate those email addresses. Using your own, unique obfuscating techniques is more likely to be safe against intelligent harvesters as it is too much work adapting the crawler for every single obfuscation technique.

James Cameron
  • 598
  • 2
  • 11
  • So they do not block email harvesters, they just obfuscate the emails for you? – Anders Mar 03 '16 at 09:45
  • 1
    CloudFlare can be quite agressive against bots in the "I'm under attack" - mode by verifying connections using javascript. By default, it will allow most bots if they dont appear to be part of a (ddos) attack. The email protection mostly consists of this obfuscation, exactly. – James Cameron Mar 03 '16 at 09:51
  • @JamesCameron Using a regex to find email addresses [isn't exactly trivial](http://ex-parrot.com/~pdw/Mail-RFC822-Address.html), although in this case you might be able to get a good enough result with a more simple one if you're willing to get some false positives/negatives. – Roujo Mar 03 '16 at 18:10
  • How do they avoid causing bugs when you already send an obfuscated email address, then de-obfuscate it on the client side using js? If they "obfuscate" the email you already obfuscated, then they would mess up your client side de-obfuscation, and provide users with the wrong email. – hhamilton Mar 03 '16 at 23:58
16

Simple bot behaviour and "normal user" behaviour are noticeable different, and most bots tend to be relatively simple, since it works for the majority of sites. For example, consider arriving on Security.SE:

  • A human loads the page, there is a delay of a few seconds upwards whilst they read the first few questions, then you get a request for a page, followed by browser initiated requests for supporting files (images, scripts, styles). You would then expect a bit of time to pass before a request with that page as the referrer comes in for another page. A more technical user might open several questions if they're using a tabbed browser, but there will be a short pause between these requests (whilst they move the mouse or tab to the next question), then, again, you expect a pause before any manual requests from these pages.
  • A bot loads the page, and immediately parses it, looking for links/email addresses. You see a large number of requests almost immediately after the page has been sent. Depending on the bot, you may find that supporting files aren't loaded (bot doesn't care about your style). The bot is likely to do the same with links from the pages received then, and keep doing it until it can't find any more links.

These methods can be bypassed with a bit of effort to make a bot look like a human, but that would slow down the crawling process a lot, so dubious bot owners don't seem to bother doing that.

Matthew
  • 27,233
  • 7
  • 87
  • 101
  • 1
    This answer is mostly about bots in general but doesnt consider the fact that the OP specificially mentions CloudFlare. I feel this doesnt address the question. If you, as it appears to be, werent focusing on CF but were asking in general terms, then this is of course a valid answer. – James Cameron Mar 03 '16 at 09:54
  • 3
    One thing people with a broader vision of the big picture often mislook is that from the *particular case* to the *big picture* there is usually some sort of connector that is left implicit but may not be obvious to the layman. In this case, is something like this line: **Cloudfare has heuristics that help them detect bots and then deny their requests;** "*Simple bot behaviour and "normal user" behaviour*"(..). – Mindwin Mar 03 '16 at 14:56
  • 3
    My impression from the question was that the asker wants to know how CloudFlare is able to detect a bot, not necessarily what they do to the content once they have detected one. I think this answer does a better job of addressing that. – Sam Mar 03 '16 at 19:57
12

In addition to James' and Matthew's answer (which are both valid points by the way):

Obviously services like CloudFlare have a bunch of detection methods to decide whether or not a client is allowed through their various layers of protection.

They have a lot of information on their website about these features but you probably won't find specific rules and implementation details as this would make detection easier to circumvent.

I guess you could see that it is some kind of bot because you get requests for multiple sites from the same IP. But that would also be the case for many legit IPs, like a VPN. How do you tell the good from the bad?

Anecdotal: I'm indeed often deemed 'suspicious' by CloudFlare whilst connected to a VPN.

I suspect a lot of the factors Matthew mentioned (load time, type of resources requested, pauses before next requests) contribute to CloudFlare not instantly blocking me.
Instead they serve up Google's ReCaptcha to confirm I am not a bot/crawler and let me through afterwards.

More info:
James' answer: E-mail obfuscation
Matthew's answer: Web Application Firewall/WAF

Fluffy
  • 437
  • 2
  • 9