I have a server hosting an intranet web site where one of the features will be the ability to upload files. The files will be saved in a restricted access folder and managed through the web back-end. To avoid name collisions, I plan on assigning UUIDs and storing the original filename along with the UUID in a database for future retrieval.
However, I do have 2 concerns:
- The possibility of duplicate files (at the actual byte level, not just by name), and
- Ensuring file integrity.
I thought if I ran some type of hash / checksum (MD5, SHA256, etc.), that could address both concerns. I could store the hash and compare the file at a future date and verify it had not gotten corrupted, and if I found another file with the same hash, I would know if the file was a true duplicate.
So my questions are:
- Are my concerns about file corruption unfounded?
- Also, is this a good strategy for identifying duplicate files?