Risks of web crawlers on public buckets

Question

So I have some data that isn't overly sensitive, but I'm still on the fence on whether or not we should invest the additional time into managing it as a private resource, vs just publicly available.

The data (images & pdfs) are to be hosted on aws' S3 bucket. They feed a web application that is private with a very limited # of users. These users (or the admin) can upload data to the bucket via the web app (or aws console). The current beta version has that data served similarly to the static files for the website, e.g. publicly available if one knows the ressource's url.

Since the application is aimed at a very restricted audience, the only way once could obtain those urls (that I can think of) would be a web crawler whose aim would be to generate random URLs variations on S3 bucket's generic url patterns (e.g. https://<bucketname>.s3.amazonaws.com/<folder_structure>/<filename>). I guess by generating enough of those, one could build a database of random files, include those imaages/pdfs. Then maybe using some classification/AI/human processing, one could try to further classify those into something actually useful. If I were to build such a crawler, I guess I would start by generating buckername, try to figure out by the responses if it corresponds to an actual bucket, then eventually refine to find filenames. Seems it would need quite a # of guesses to get that working.

Am I missing an obvious flaw here? Is the above I describe actually pretty common place, meaning that sooner or later that data will find its way into some random database? More generally... how does one draw the line on what should be covered by an enforced access policy vs the default "security by obscurity" that random urls "provide"?

If you want to prevent crawlers from *guessing* the URLs, they should use long, random names – such as version 4 UUIDs generated by a cryptographically secure RNG – and disable listing operations. But if you want to prevent *unauthorized access*, then using public buckets is entirely unsuitable. The appropriate security level depends on the sensitivity of the images/documents being stored. To me, it sounds like you don't want these documents to be publicly available? — amon, Jan 26 '22 at 23:11
Maybe you're right - if I'm asking the question, maybe that's the answer. It's just that I still have to justify using the time/paying someone to do. For the somewhat remote possibility that the maybe 0.3% of files there which we'd rather not give away end up (possibly) available to 3rd parties IN CASE the bucket/directories/files are found.... hence the dilemma. — Francky_V, Jan 27 '22 at 13:46

Risks of web crawlers on public buckets

0 Answers0