Which archive file formats provide recovery protection against file corruption?

10

4

I use my external HDD to back up my files, by putting them into large archive files.

I have thousands of tiny files, and put them into archives of 500MB to 4.2GB in size, before sending them to the external HDD. But, does one hard disk failure destroy the whole archive or only one file in the archive? I fear that one flipped bit could render large parts of the archive useless.

Things like CRC checks can alert you to the existence of corruption, but I am more interested in the ability to recover the undamaged files from a corrupted archive. What archive file formats would provide the best ability to recover from such failures, either through the native design of the archive structure or the existence of supplementary recovery tools? Is there any difference in this capability between zip and iso files?

sevenkul

Posted 2014-03-28T07:52:10.237

Reputation: 211

Please reopen the question. I have reworded it, and it should be more clear now. "Best" will always be somewhat opinion-based, but the requirement to be best here are quite clear. Little room for personal opininons IMHO. Please delete this comment after reopening. – Marcel – 2015-02-11T13:03:11.643

I know at least one of the programs I use for file synchronization supports multithreaded copying, which I believe mitigates some of the slowness of copying lots of small files; also, though I would have to test to be sure, I have a suspicion that creating an archive of lots of small files would also take longer than creating an archive for several large files, even if no compression is used. I don't remember if this is a Windows-only issue or not, though; iirc, there are some software solutions available for Linux that can handle lots of small files in blocks, but I can't recall the details. – JAB – 2014-03-28T14:01:33.547

Answers

8

Given that a damage to a directory part of any archive could potentially render entire archive useless, your best bet would be to add separate step to your backup process to generate so-called parity files. In case if a data block in original file gets damaged, it can be reconstructed by combining data from the parity file with valid blocks from the original file.

The variable there would be how much damage you'd like to be able to repair from. If you want to protect against a single bit flip, then your parity file will be just 1 bit in size. If you want something in a tune of a disk sector size, then obviously it'll obviously cost you more.

There's a big theory behind this (see Forward Error Correction) and it is widely used in practice. For example, this is how CDs can withstand certain degree of scratching and how cell phones can maintain reasonable call quality over lossy connections.

Long story short, take a look at .par files.

Angstrom

Posted 2014-03-28T07:52:10.237

Reputation: 610

1Thanks, while searching parity files I found WinRAR's recovery record simpler in daily use. I will also try QuickPar. – sevenkul – 2014-03-28T13:11:50.720

2One bit of error correction data is not sufficient to repair a one-bit error in your n-bit data file. You could detect such an error with a single bit, but to repair it, you need at least log n bits. – Thom Smith – 2014-03-28T14:53:27.450

4

Bup [1] backs up things and automatically adds in parity redundancy, making bit-rot extremely unlikely. Catastrophic disk failure is still a thing, so we can use it with git-annex.

git-annex [2] manages files stored on many repositories, some of which might be stored on your computer, thumb-drives, ssh login, some cloud services or a bup backup repository [3], letting the file data flow pretty much transparently by request or automatically into whichever repository you've set. It is also a crowd funded free and open source software project which was written in Haskell with versions running on many platforms, including linux, mac, windows and android.

[1] https://github.com/bup/bup

[2] http://git-annex.branchable.com/

[3] http://git-annex.branchable.com/special_remotes/bup/

Yuval Langer

Posted 2014-03-28T07:52:10.237

Reputation: 231

3

But, does one harddisk failure destroy the whole archive or only one file in the archive?

If there is really no alternative to copying everything as one big archive you probably have to make a decision between using a compressed or uncompressed archive.

The contents of uncompressed archives like tarballs can still be detected with file recovery software even if the archive file itself can not longer be read (e.g. due to a corrupt header).

Using compressed archives can be dangerous because some could refuse to extract files if a checksum error occurs which can be caused even if only one bit of the archive file changes.

Of course one can minimize the risk by not storing hundreds of files into one compressed archive but hundreds of compressed files into one uncompressed archive.

gzip *
tar cf archive.tar *.gz

Though I have never seen lots of gzipped files in a tarball in wildlife before. Only the opposite is popular (i.e. tar.gz files).

Is there any difference between zip and iso files?

ZIP is a (mostly but not necessarily) compressed archive and ISO is a format that indicates raw data copied on a low-level basis from an optical disk into a file. The latter can contain literally everything.

dulange

Posted 2014-03-28T07:52:10.237

Reputation: 204