1
2
Is there a file checksum designed specifically for recovering a single file (archive) with data corruption? Something simple like a hash that can be used to recover the file
I am trying to archive some backups of home and business files (not media files) by compressing them and dating them. Largest archive currently runs about 250GB. After an archive was created I did a MD5 checksum on it, transferred the archive to another drive, then used the MD5 to verify files were transferred correctly and stored the MD5 hashes with the archives for future verification. I plan on trying to archive these backups 1-2 times a year and store them on HDD and tapes as budget allows.
Current archive format is "Zipx" with highest settings.
Given the volume of information of about 1-2 TB a year currently, I forsee having some sort of data corruption to deal with; especially given these files are on consumer drives. Add in that backups end up getting transferred around from to drive to drive, to tape, and back again that an initial 250GB archive can actually be many terabytes of written and read data increasing the risk of data corruption. And verifying MD5s after each transfer adds alot of time as the MD5 check is I/O limited; a MD5 check on a 250GB archive takes a long time multiplied by all the archives and MD5s are bound to not get checked as often as they need to.
So the assumptions are:
- Data will get corrupted
- We will not know about it until after the fact.
- Due to budget restrictions and the lack of "mission critical", we do not have multiple copies of the exact same backup archives, only different iterations of backups.
- We want to minimize the copies of our backups while protecting against data corruption.
- If a file or two in an archive does get corrupted and we lose the data when we try to restore; life will go on. This is not a mission critical thing.
- The archives are a secondary backup and hopefully will not get used more than a couple times in a decade or less. A live backup exists uncompressed.
With these assumption how do we protect against the data corruption.
Storing the MD5 hash only allows someone to know whether the current data matches the original data or not. It does not allow someone, or help in any way, to repair the data. That is if I need to restore from backup and have data corruption on the file or files I need, an MD5 is effectively useless.
So is there a checksum that is specifically designed to not only verify data but repair it as well? Kind of like ECC for memory but for files?
Note: I did find parchive, but it does not seem to be current and reliably useable. While I may not like how they implemented things, in general parchive is exactly what I am looking for but cannot find. Does something parchive-like exist that is "production" ready?
Update: It looks as though some archive formats do support recovery although the only mainstream one seems to be WinRAR. It would prefereable to not get locked into a format simply for this one option as most achiving formats (75% +/- in the linked list) do not seem to support recovery.
ECC adds redundancy whilst compressors tend to reduce it to minimal. 1 bit error in compressed file is likely to alter several files. When MD5 differs, which one is faulty ? :) – levif – 2016-11-11T23:43:20.110
Kind of my point. Which is why I am looking for something that could rebuild the data outside of the archive to avoid issues given the archive vs raw files. It would have to work on the bit level vs file level. It seems what ever it is would use Reed-Solomon error correction. But nothing I find seems to be user friendly, simple, long standing, and/or ready for use. Everything seems old or unsupported, complicated, etc. – Damon – 2016-11-12T00:58:03.233