So I have some data that isn't overly sensitive, but I'm still on the fence on whether or not we should invest the additional time into managing it as a private resource, vs just publicly available.
The data (images & pdfs) are to be hosted on aws' S3 bucket. They feed a web application that is private with a very limited # of users. These users (or the admin) can upload data to the bucket via the web app (or aws console). The current beta version has that data served similarly to the static files for the website, e.g. publicly available if one knows the ressource's url.
Since the application is aimed at a very restricted audience, the only way once could obtain those urls (that I can think of) would be a web crawler whose aim would be to generate random URLs variations on S3 bucket's generic url patterns (e.g. https://<bucketname>.s3.amazonaws.com/<folder_structure>/<filename>
). I guess by generating enough of those, one could build a database of random files, include those imaages/pdfs. Then maybe using some classification/AI/human processing, one could try to further classify those into something actually useful. If I were to build such a crawler, I guess I would start by generating buckername, try to figure out by the responses if it corresponds to an actual bucket, then eventually refine to find filenames. Seems it would need quite a # of guesses to get that working.
Am I missing an obvious flaw here? Is the above I describe actually pretty common place, meaning that sooner or later that data will find its way into some random database? More generally... how does one draw the line on what should be covered by an enforced access policy vs the default "security by obscurity" that random urls "provide"?