7

How can I prevent client software scraping data from my website?

Some examples of my URLs are:

It's very easy to run a loop and request the server with a changed query string. What are the security measures I can take? The information on my site is published weekly (huge data entry cost), and I don't want someone to run a script and get all of it in a few minutes.

Jenny D
  • 1,197
  • 9
  • 18
brainHax
  • 273
  • 2
  • 8

3 Answers3

21

You can't really prevent it if the data is publicly available.

But that doesn't mean that you have to make it extra easy to scrape the data either.

Preventing Enumeration

By exposing internal, ordered ids you make it extra easy to scrape all products.

If you changed it to either product name or a random id, an attacker couldn't retrieve all the data with a simple loop.

Throttling Requests

You could limit the amount of requests a user can make. This isn't all that easy though, because you can't really limit by IP address (you would restrict users using the same IP address , and attackers can just change their IP address). Here is a question about identifying users with some alternative ideas.

"Honeypot"

You could create fake products, which you never link to (or only in links hidden via CSS).

When someone views such a product, ban them.

Alternatively, you could add quite a lot of these products, and NOT ban a scraper, but just let them keep the wrong data, making their data less accurate (this may or may not make sense in your case).

Obscure Data

You could try to make it harder for a scraper to use your data. This may impact your users and may be quite a bit of work on your part (and as with all approaches, a determined attacker can still get the data):

  • Put part of the data in an image.
  • Change your HTML often (so an attacker has to change their HTML parser as well).
  • Mask/encrypt your data and use JavaScript to unmask/decrypt (change the method from time to time, so an attacker would need to change theirs as well).

Limit Access

You could put the content behind a login, and ban users that scrape data (probably not a good idea in your case, as you do want users that don't have an account to see products).

The Law

Everyone is free to scrape the data on your website (probably; it may depend on your jurisdiction). But re-publishing is likely in violation of the law, so you could sue them.

Peter Mortensen
  • 877
  • 5
  • 10
tim
  • 29,018
  • 7
  • 95
  • 119
  • 4
    "Put part of the data in an image." is +1 here, but big - on UX. And in some cases might even be illegal - like if you took some public money to make this site, like a grant for new business, and now are obliged to make it as accessible for people with disabilities - this one is against visually impaired people. Just noting, because this particular point shouldn't be followed unless someone really knows what it means, and it means more than it looks. – Mołot Jan 24 '16 at 13:26
  • @MarkBuffalo well, like I said, scraping can't be prevented. But the cost can be increased for a scraper, and if the cost is increased enough, it may not be worth it anymore (this obviously doesn't work for people who write scrapers for fun, as it just makes it more interesting :) ). Some of the ideas (such as the first one) are very easily implemented, so I would say they are worth it. Others may not be, depending on the concrete situation. – tim Jan 24 '16 at 16:20
  • I disagree with pretty much every single idea here. I write scrapers for fun, and can bypass every single thing you've listed here. These are options people try and take to stop my scrapers. And in the end, it just doesn't work. I will find a way around everything. – Mark Buffalo Jan 24 '16 at 16:22
  • @tim Yeah, it just doesn't work. This is a good post, because it really shows what people try to do to stop it. My post will show part of how I easily circumvent everything people throw at me. – Mark Buffalo Jan 24 '16 at 16:24
8

Scrape Artist here.


You can't stop me no matter who you are

There is no reliable way to do it. You can only make it harder. The harder you make it, the harder you make it for legitimate users. I write web scrapers for fun, and not only have I bypassed all of the above ideas by tim, but I can do it quickly.

From my perspective, anything you introduce to stop me, if your information is valuable enough, I will find a way around it. And I will do all of this faster than you can fix it. So it's a waste of your precious time.

  • Preventing enumeration: doesn't stop a mass dump of your pages. Once I've got your data, I can parse it on my end.
  • Throttling requests: bans legitimate users who look around. Doesn't matter: I will test to see how many requests are allowed in a certain period of time using a VPN. If you ban me after 4 attempts, I will update my scraper to use my distributed proxy list to get 4 items per attempt. This is super easy, as every single connection asks me if I want a proxy. Simple example:

    for (int i = 0; i < numAttemptsRequired; i++)
    {
           using (WebClient wc = new WebClient())
           {
                if (i % 4 == 0)
                {
                     wc.Proxy = proxyList[curProxy];
                     curProxy++;
                }
           }
    }
    

    I can also add a simple method to make sure it doesn't happen more than a few times per second, at the same speed as a regular user.

  • "Honeypot": bans legitimate users who are looking around and will likely interfere with the user experience.

  • Obscure Data:
    • Put part of the data in an image: hurts visually impaired users by making your website inaccessible. I'll still download your images, so it'll all be for naught. There are also a lot of programs to read text from images. Unless you're making it horribly unclear and warped (which, again, affects user experience), I'll get the information from there as well.
    • Change your HTML often (so an attacker has to change their HTML parser as well): Good luck. If you change names and stuff, you'll likely be introducing a pattern. If you introduce a pattern, I will make the scraper name-agnostic. It will be a never-ending arms race, and it would only take me a few minutes to update it if you changed the pattern. Meanwhile, you've exhausted tons of time making sure everything works, and then you have to update your CSS, and probably javascript. Looks like you are continually breaking your website at this point.
    • Mask/encrypt your data and use JavaScript to unmask/decrypt (change the method from time to time, so an attacker would need to change theirs as well): One of the worst ideas I've ever heard. This would introduce so many potential bugs to your website you'd be spending a large amount of time fighting it. Parsing this is so mind-numbingly easy, it would take me a few seconds to update the scraper. And meanwhile, it probably took you 324 hours to get it working right. Oh, and some of your users might not have Javascript enabled thanks to NoScript. They'll see garbage, and leave the site before allowing it.
  • Limit Access: My scrapers can log in to your website if I create an account.
  • The Law: The law only pertains to information in which your country shares the same law with the scraper in question. Nobody in China is going to care if you scrape a U.S. website and republish everything. Nobody in most countries will care.

And in all of this, I can impersonate a legitimate user by sending fake user-agents, etc., based on real values.


Is there anything I can do?

Not really. Like I said, you can only make it harder to access. In the end, the only people getting hurt will be your legitimate users. Ask yourself this, "How much time would I spend on this fix, and how will it affect my customers? What happens if someone finds a way around this quickly?"

If you try to make your website too difficult to access, you may even end up introducing security holes, which would make it even easier for malicious visitors to exploit your website and dump all of your information without needing to scrape it. Even worse, your customers could be affected negatively.

If you limit attempts and require authentication, this can slow down the aggregation of your main website, I can still find everything in search results/linked pages.

In the end, it doesn't matter. I'll still get around that with my proxy list. I'll still get all of your data much faster than a normal user could.


Winning through a better user experience

I believe that if you present a quality browsing experience to your users, and you have a good product, people will come to you even if others have the data as well. They know your website works, and they aren't frustrated with bugs, nor are they plagued with usability problems.

Here's an example: on amazon.com, I can aggregate almost all of their products very quickly by changing a few numbers. If I take that data, where does it get me? Even if I have the products, people will still be visiting Amazon.com, and not my knock-off website.

Mark Buffalo
  • 22,498
  • 8
  • 74
  • 91
  • What about `Honeypot` = creating fake products, fake (hidden) data, etc? For example, one of my websites is basically a database of technical data that can be displayed as a list and each item can be viewed in detail on another page. It would be easy to add some incorrect (hidden) data in the middle of the details, making the scraped data essentially useless. Of course, you could always try to detect which hidden fields are fake and which are just hidden because of UX. It seems like it would make it harder for some more incompetent scrapers, especially if I introduce some randomness? – Magnus Apr 02 '17 at 06:31
  • Yeah, you could do that if you wanted to. I'd be interested to see the results. – Mark Buffalo Apr 05 '17 at 20:14
  • Why not disallow headless browsers? Or Selenium or PhantomJS? Anyone heard of a .zip bomb from a decade ago? Same principle, iframe only bots can see, let him parse into infinity. One more: Write your own bot to observe his scrapers behavior and record it. based on that you might find a weakness – shamelessApathy Sep 04 '19 at 01:40
5

As well as the excellent points in Tim's answer, there are a couple more options

Complain to their ISP

If scraping your site is a violation of your terms & conditions, then if you complain to the ISPs of scrapers you have identified from their logs, they will generally tell them to stop doing it.

Live with it

Try to quantify the damage the scraping is doing to you. Compare that with the effort required to stop it. Is it really worth worrying about, or is it just an annoyance that is best ignored?

Mike Scott
  • 10,118
  • 1
  • 27
  • 35