md5sum on large files

7

4

Context:

I have a large terabyte drive with various types of large media files, ISO image files, etc. I would like to verify its contents using md5sum on the first megabyte due to speed/performance.

You can create a sum like this:

FILE=four_gig_file.iso
SUM=$(head -c 1M "$FILE" | md5sum)
printf "%s *%s\n" ${SUM%-} "$FILE" >>test.md5

How would you verify this as the first megabyte's signature is different than the whole file's?

I've seen this done in other languages, but I am wondering how to do it in Bash. I've experimented with various md5sum -c permutations involving pipes and whatnot.


Instead of using md5sum -c, would you have to recompute the hashes into a new file, then 'diff' them?

You can use a

find /directory/path/ -type f -print0 | xargs -0 md5sum blah blah

to work on a large number of files.

PS: Rsync is not an option

UPDATE 2: So as it stands --

Using head, find, and md5sum; one could then create a file from the source directory fairly quickly, then check it with diff on the other side after computing on the destination. Are there clever one-liners or scripts for this?

Bubnoff

Posted 2011-07-29T20:30:16.880

Reputation: 277

Answers

7

To verify contents by only sampling the first megabyte of a file will likely not detect if some of the larger files have been corrupted, damaged or altered in one way or another. The reason for that is you're only giving the hashing algorithm one megabyte of data when there might be hundreds of other megabytes that could be off. Even one bit in the wrong position would give a different signature.

If data integrity is what you want to verify, you're better off with the CRC32 algorithm. It's faster than MD5. Although it it is possible to forge/modify a file to appear to have the correct CRC32 signature, it is not likely that random bits of corruption will ever do that.

Update:

Here's a nice one-liner to do the 1 megabyte based md5 checksum on every file:

find ./ -type f -print0 | xargs -0 -n1 -I{} sh -c "echo '{}' >> output.md5 && head -c 1M '{}' | md5sum >> output.md5"

Replace md5sum with cksum if you feel like it. Notice that I chose to include the filename in the output. That's because the filename string does not get passed on when you're not giving md5sum the whole file.

jesper

Posted 2011-07-29T20:30:16.880

Reputation: 361

Keep in mind that there is an overhead caused by repeatedly launching the required utilities for each file, which would slow things down. A python script or similar would be a good option for this. – tripflag – 2015-10-02T10:58:22.723

1Could using tail instead of head be an improvement, since the beginning of a file is less likely to be corrupted than the end? – Sridhar Sarnobat – 2017-01-04T00:05:24.677

There is that risk, yes. The issue is the speed and performance of md5sum across a large capacity drive containing very large files. To checksum the entire file is not practical. Thank you for the CRC32 tip, I will definitely check into that. – Bubnoff – 2011-07-29T21:15:30.150

Does the Linux command cksum use CRC32? Both man and info say 'CRC' but not CRC32. – Bubnoff – 2011-07-29T21:26:21.333

Yes, cksum uses CRC32. There are a few variations of CRC32 (or CRC) which could be confusing if you were to compare checksums from other tools. Check out the wikipedia article for more details: http://en.wikipedia.org/wiki/Cyclic_redundancy_check

– jesper – 2011-07-29T21:40:53.890

>

  • Thanks! Good to know. I'll do some tests and weigh options from there. Since this is mostly to test the presence/absence of a file, 1 MB might be an OK risk as far as corruption. If CRC32 is fast enough though, maybe I don't need to take that risk.
  • < – Bubnoff – 2011-07-29T21:44:29.263

    used time with both md5sum and cksum - I show that cksum is slower than md5sum. – Bubnoff – 2011-07-29T21:56:05.763

    I just tried it on a 3 gigabyte file and they were approximately using the same amount of time. If you really want to just check the presence of files, it's fairly easy to do so. I'll update my answer. Then you should probably do a long crunch to verify the integrity afterwards. – jesper – 2011-07-29T22:04:28.410

    md5 and sha1 are likely to be I/O bound on any modern system, and have a much smaller chance of collision than crc32. – afrazier – 2011-07-29T23:36:25.390

    Nice one-liner! Thinking I might script this up to handle a directory and output file. Thanks! – Bubnoff – 2011-07-30T01:46:42.813