8

We have a searchable Database(DB) , we limit the results to 15 per page and only 100 results yet still get people trying to scrape the site.

We are banning sites that hit it fast enough. I was wondering if there is anything else that we can do. Flash render the results maybe?

zero8
  • 101
  • 5
Randin
  • 183
  • 3
  • Make sure you have a robots.txt ... yeah I know not everyone honors it .. but some still do – trent May 12 '09 at 02:21

7 Answers7

13

Since there is obviously a demand for your database, have you thought about turning it around and providing what the scrapers want? Form a business connection with the scrapers and encourage appropriate use with an API?

John McC
  • 908
  • 1
  • 7
  • 15
7

There is some good info in How do you stop scripters from slamming your website hundreds of times a second?

cletus
  • 9,779
  • 9
  • 36
  • 40
6

You could make it a bit more difficult by retrieving the records via AJAX, and using an authentication ID (like an API key) for the AJAX calls.

Of course you can get around this by reading the ID and then making the AJAX request using that.

Rendering with Flash is an alternative as you point out (though still not 100% unscrapable), as is rendering in PDF.

womble
  • 95,029
  • 29
  • 173
  • 228
Ivan
  • 3,172
  • 3
  • 24
  • 34
4

There is no technological solution to prevent a motivated individual from scraping your publicly accessible content.

You can, however, legally protect your intellectual property by:

  • Ensuring that your site has a clearly marked copyright
  • Posting a Terms of Service in the footer that clearly prohibits scraping
  • Consider embedding a digital watermark into all of your site's content. Don't forget that text can be watermarked as well!
Portman
  • 5,263
  • 4
  • 27
  • 31
2

How about setting up authentication (and perhaps captcha), tracking usage, and limiting access to some number of records or searches is a given time period?

tomjedrz
  • 5,964
  • 1
  • 15
  • 26
1

You will probably find that the scrapers will improve their came as you apply different techniques. Perhaps there is a way to analyze the behavior of users who scrape and present a captcha or other disruption? Perhaps you could limit the results to a smaller number for a period of time to force the scrapers to wait for 10 days. If they don't log on in between then assume they are scapers?

Whatever you do, make sure to mix up your techniques to give them a little more longevity.

Brian Lyttle
  • 1,747
  • 1
  • 17
  • 17
1

You need to consider that the scrapers may not be using your web pages and forms, they may just be calling your site at a http level.

I think that the best solution would be to throw up a catchpa after an ip address requests more than a certain request threshold.

You need to be VERY careful though to ensure that you do not affect the scalability of you applicaiton for real users.

Limiting the amount of data per page as you describe in the question will only increase the number of requests that the clients will make against your server.

Bruce McLeod
  • 1,738
  • 2
  • 14
  • 12