0

I have a server hosting an intranet web site where one of the features will be the ability to upload files. The files will be saved in a restricted access folder and managed through the web back-end. To avoid name collisions, I plan on assigning UUIDs and storing the original filename along with the UUID in a database for future retrieval.

However, I do have 2 concerns:

  1. The possibility of duplicate files (at the actual byte level, not just by name), and
  2. Ensuring file integrity.

I thought if I ran some type of hash / checksum (MD5, SHA256, etc.), that could address both concerns. I could store the hash and compare the file at a future date and verify it had not gotten corrupted, and if I found another file with the same hash, I would know if the file was a true duplicate.

So my questions are:

  1. Are my concerns about file corruption unfounded?
  2. Also, is this a good strategy for identifying duplicate files?
Big_Al_Tx
  • 101
  • 2
  • `1.` Editing a response into your question is not the right approach `2.` Nope! That's our way of saying we don't want questions about *AMP stacks. They're not relevant to systems administration, and should not be used as production systems. They are a convenience for developers, not something a sysadmin should be dealing with. – HopelessN00b Feb 26 '15 at 14:27
  • Then why does the message about this being a 'duplicate' say to 'please **edit** this question to explain how it is different'??? – Big_Al_Tx Feb 26 '15 at 14:38
  • Also, the question is not about the LAMP itself -- it's about managing files uploaded to the server. The LAMP is just what I'm using as the mechanism for uploading the files. – Big_Al_Tx Feb 26 '15 at 14:45
  • I don't see anything about editing the question to explain how it's different, but regardless, the proper solution is to edit your question to **make** it different. As it stands, your question is about development of an *AMP stack, which makes it off topic for two reasons (and the four users before me voted to close it for being a development question, FWIW). To get it reopened (and possibly even get useful answers), you need to address both those problems. So, remove the LAMP thing and just ask about data integrity and deduplication of user uploaded files to an Apache webserver with MySQL. – HopelessN00b Feb 26 '15 at 14:55
  • 1
    Well, being off-topic is also completely different than being a 'duplicate' question ... – Big_Al_Tx Feb 26 '15 at 15:16

1 Answers1

0

1) file corruption is not common, and the underlying system should prevent and warn of such things but yes it's nice to double check. Better yet have a backup off site http://en.wikipedia.org/wiki/Comparison_of_backup_software

2) if your using hashs anyway there is no need for other strategies, but yes there thinks like rsync move detection that will compare all files by size which is nice and fast then any of the same size will be hashed if not already and checked for uniqueness. Depending on the file content there are other options like git for text, or quality trumping for media.

user1133275
  • 195
  • 1
  • 11