Scrape Artist here.
You can't stop me no matter who you are
There is no reliable way to do it. You can only make it harder. The harder you make it, the harder you make it for legitimate users. I write web scrapers for fun, and not only have I bypassed all of the above ideas by tim, but I can do it quickly.
From my perspective, anything you introduce to stop me, if your information is valuable enough, I will find a way around it. And I will do all of this faster than you can fix it. So it's a waste of your precious time.
- Preventing enumeration: doesn't stop a mass dump of your pages. Once I've got your data, I can parse it on my end.
Throttling requests: bans legitimate users who look around. Doesn't matter: I will test to see how many requests are allowed in a certain period of time using a VPN. If you ban me after 4 attempts, I will update my scraper to use my distributed proxy list to get 4 items per attempt. This is super easy, as every single connection asks me if I want a proxy. Simple example:
for (int i = 0; i < numAttemptsRequired; i++)
{
using (WebClient wc = new WebClient())
{
if (i % 4 == 0)
{
wc.Proxy = proxyList[curProxy];
curProxy++;
}
}
}
I can also add a simple method to make sure it doesn't happen more than a few times per second, at the same speed as a regular user.
"Honeypot": bans legitimate users who are looking around and will likely interfere with the user experience.
- Obscure Data:
- Put part of the data in an image: hurts visually impaired users by making your website inaccessible. I'll still download your images, so it'll all be for naught. There are also a lot of programs to read text from images. Unless you're making it horribly unclear and warped (which, again, affects user experience), I'll get the information from there as well.
- Change your HTML often (so an attacker has to change their HTML parser as well): Good luck. If you change names and stuff, you'll likely be introducing a pattern. If you introduce a pattern, I will make the scraper name-agnostic. It will be a never-ending arms race, and it would only take me a few minutes to update it if you changed the pattern. Meanwhile, you've exhausted tons of time making sure everything works, and then you have to update your CSS, and probably javascript. Looks like you are continually breaking your website at this point.
- Mask/encrypt your data and use JavaScript to unmask/decrypt (change the method from time to time, so an attacker would need to change theirs as well): One of the worst ideas I've ever heard. This would introduce so many potential bugs to your website you'd be spending a large amount of time fighting it. Parsing this is so mind-numbingly easy, it would take me a few seconds to update the scraper. And meanwhile, it probably took you 324 hours to get it working right. Oh, and some of your users might not have Javascript enabled thanks to NoScript. They'll see garbage, and leave the site before allowing it.
- Limit Access: My scrapers can log in to your website if I create an account.
- The Law: The law only pertains to information in which your country shares the same law with the scraper in question. Nobody in China is going to care if you scrape a U.S. website and republish everything. Nobody in most countries will care.
And in all of this, I can impersonate a legitimate user by sending fake user-agents, etc., based on real values.
Is there anything I can do?
Not really. Like I said, you can only make it harder to access. In the end, the only people getting hurt will be your legitimate users. Ask yourself this, "How much time would I spend on this fix, and how will it affect my customers? What happens if someone finds a way around this quickly?"
If you try to make your website too difficult to access, you may even end up introducing security holes, which would make it even easier for malicious visitors to exploit your website and dump all of your information without needing to scrape it. Even worse, your customers could be affected negatively.
If you limit attempts and require authentication, this can slow down the aggregation of your main website, I can still find everything in search results/linked pages.
In the end, it doesn't matter. I'll still get around that with my proxy list. I'll still get all of your data much faster than a normal user could.
Winning through a better user experience
I believe that if you present a quality browsing experience to your users, and you have a good product, people will come to you even if others have the data as well. They know your website works, and they aren't frustrated with bugs, nor are they plagued with usability problems.
Here's an example: on amazon.com, I can aggregate almost all of their products very quickly by changing a few numbers. If I take that data, where does it get me? Even if I have the products, people will still be visiting Amazon.com, and not my knock-off website.