How to crawl a web site if content is only visible to registered accounts?

Question

I am reading abpit the attack amd defense strategies of web spiders. Assume I have sensitive information on my website, which should be protected from 3rd party web spiders.

Use case #1:

Me: I set the sensitive data only visible to registered user account. Tourist accounts cannot see them, and hence cannot crawl.
Attacker: Register an account, use the cookie to auto crawl.

Use case #2:

Me: I find the suspicious behavior (e.g., download page > some threshold) of the account in use case #1, and limit the privilege of that account.
Attacker: Register (or buy) multiple accounts, use multiple accounts to crawl from in a distributed and automatic way, so individual account looks less suspicious.

Question:

In both use cases, are the attacker's method practical?
What are the important tips to prevent these two attacks?

If there is value to be had then you would be amazed at how far attackers will go. Here are some articles about different methods hackers have used to try to get money out of Uber, all of which require far more effort on the part of the hackers, and far more effort on the part of Uber to counteract. https://www.businessdailyafrica.com/corporate/companies/Uber--Taxify-drivers-use-fake-apps-to-defraud-riders/4003102-4843716-qov2mk/index.html https://www.vice.com/en_us/article/535zdn/scammers-say-they-got-uber-to-pay-them-with-fake-rides-and-drivers — Conor Mancone, Aug 02 '19 at 12:09

score 4 · Accepted Answer · answered Aug 27 '19 at 11:12

Bad news: Both those attacks are absolutely realistic and practical.

Even more bad news: You can't really protect against this. You can make it harder to crawl your website, but you can't make it impossible. In the end, a motivated attacker will always win.

If you settle on just making crawling harder, here are some tips:

Rate limit, both on account and on IP. But the attacker can just use even more accounts and even more IPs or just let it take more time.
Captchas! But to stop a human from solving it once, and then letting the crawler take over, you'll need to serve them often. Perhaps on every page load. That is not great. And a lot of captchas can be bypassed, by software or paid humans.
Load sensitive content dynamically using JS. This will trick dumb static crawlers who just reads the HTML as a long text string, without e.g. executing JS. There are plenty of crawlers that act like full browsers though, and this will not help against those.
Make sensitive information hard to read by e.g. putting it in images instead as text. This is a usability and accesibility nightmare, and there is nothing stopping a crawler from saving the images or even doing OCR on them. So this isn't a real solution either.

If you want to make your content safe from crawlers, the only solution is to not give access at all to people you do not trust to keep their hands off.

I just finished a crawler for a registered-users-only website (we were very thorough to make sure we didn't violate any ToS and were accessing our own data only). Anyway, all that to say: not only it is possible for someone to make a crawler for your registered-users-only service, but they will probably have to dig through your code to do it. When they do, they *will* be judging you for your **absolutely terrible** coding practices :p — Conor Mancone, Aug 27 '19 at 13:19

How to crawl a web site if content is only visible to registered accounts?

1 Answers1