20

Here's my problem : I need to archive to tar files a lot ( up to 60 TB) of big files (usually 30 to 40 GB each). I would like to make checksums ( md5, sha1, whatever) of these files before archiving; however not reading every file twice (once for checksumming, twice for tar'ing) is more or less a necessity to achieve a very high archiving performance (LTO-4 wants 120 MB/s sustained, and the backup window is limited).

So I'd need some way to read a file, feeding a checksumming tool on one side, and building a tar to tape on the other side, something along :

tar cf - files | tee tarfile.tar | md5sum -

Except that I don't want the checksum of the whole archive (this sample shell code does just this) but a checksum for each individual file in the archive.

I've studied GNU tar, Pax, Star options. I've looked at the source from Archive::Tar. I see no obvious way to achieve this. It looks like I'll have to hand-build something in C or similar to achieve what I need. Perl/Python/etc simply won't cut it performance-wise, and the various tar programs miss the necessary "plugin architecture". Does anyone know of any existing solution to this before I start code-churning ?

wazoox
  • 6,782
  • 4
  • 30
  • 62
  • 3
    Certainly sounds like a useful addition to `tar` if you decide to write it ;) –  Mar 09 '10 at 00:09
  • 1
    Not your question, but with `7z` you can choose the hash and print it in a way that `sha1sum` and `sha256sum` can understand: http://7zip.bugaco.com/7zip/7zip_15_09/MANUAL/cmdline/commands/hash.htm (and http://www.sami-lehtinen.net/blog/using-7-zip-hashing-to-compare-directories-and-files ) Try it: `7z h -scrcsha256 mydir/* | sed --regexp-extended 's, +[0-9]+ +, ,g' > mydir.sha256sum ; sha256sum -c mydir.sha256sum` (tested with p7zip Version 15.09 beta) – Nemo Dec 18 '15 at 12:12

4 Answers4

17

Before going ahead and rewriting tar, you may want to profile the quick-and-easy method of reading the data twice, as it may not be much slower than doing it in one pass.

The two pass method is implented here:

http://www.g-loaded.eu/2007/12/01/veritar-verify-checksums-of-files-within-a-tar-archive/

with the one-liner:

  tar -cvpf mybackup.tar myfiles/| xargs -I '{}' sh -c "test -f '{}' && 
  md5sum '{}'" | tee mybackup.md5

While its true that md5sum is reading each file from disk in parallel with tar, instead of getting the data streamed through the pipe, Linux disk cacheing should make this second read a simple read from a memory buffer, which shouldn't really be slower than a stdin read. You just need to make sure you have enough space in your disk cache to store enough of each file that the 2nd reader is always reading from the cache and not getting far enough behind to have to retrieve from disk

bk.
  • 768
  • 1
  • 4
  • 13
  • 3
    It actually works quite fine, it looks limited by the CPU ability to crunch md5 ( ~280MB/s on one core). – wazoox Mar 09 '10 at 14:45
5

Here's an example Python script. It calculates the checksum of the file as its being added to the archive. At the end of the script, the checksum file is added to the archive.

import hashlib,os
import tarfile
def md5(filename):
    ''' function to get md5 of file '''
    d = hashlib.md5()
    try:
        d.update(open(filename).read())
    except Exception,e:
        print e
    else:
        return d.hexdigest()

root="/home"
outtar=os.path.join(root,"path1","output.tar")
path = os.path.join(root,"path1")
chksum_file=os.path.join(root,"path","chksum.txt")
tar = tarfile.open(outtar, "w")
o_chksum=open(chksum_file,"w")
for r,d,f in os.walk(path):
    for files in f:
        filename=os.path.join(r,files)
        digest="%s:%s"%(md5(filename) , filename)
        o_chksum.write(digest+"\n")
        tar.add(os.path.join(r,files))

tar.add(chksum_file)
tar.close()
o_chksum.close()

When you untar, use the chksum_file to verify the checksum

user37841
  • 341
  • 1
  • 2
  • 1
    Yes that's something like this that I thought about, but usually these sort of libraries load the file in RAM before manipulating it, and my files are at least 20 GB.... – wazoox Mar 08 '10 at 13:38
1

I think that your problem is a design issue of tar as tar does not allow random access/positioning inside the archive file via a content table, thus all protocols will be file and not buffer based.
Thus you may look at different formats like PAX or DAR which allow random access.

weismat
  • 343
  • 3
  • 16
0

Recent archives formats generally include some hash for file verification, but they have a similar issue: you can't always choose your own hashing function, nor keep a local copy of the hashes.

You might want to save a local copy of the hashes, distinct from the one embedded in the archive itself: for instance if the archive is stored offline (on tapes or a data centre which is expensive to read from) and you want to verify a local copy of a file/directory.

7zip has several options like 7z h with custom hash and 7z l -slt to list all hashes and whatnot but what if you want a list of md5 or sha1 hashes? You can use -bb and -bs to control verbosity and reuse the George Notaras method mentioned in the accepted answer:

7z a -bsp1 -bb3 dir.7z dir 2>&1 \
| grep "^+" | sed 's,^+ ,,g' | xargs -d "\n" -I § -P 1 sh -c "test -f '§' && sha1sum '§'" \
| tee dir.sha1
Nemo
  • 259
  • 1
  • 13