7
4
Context:
I have a large terabyte drive with various types of large media files, ISO image files, etc. I would like to verify its contents using md5sum
on the first megabyte due to speed/performance.
You can create a sum like this:
FILE=four_gig_file.iso
SUM=$(head -c 1M "$FILE" | md5sum)
printf "%s *%s\n" ${SUM%-} "$FILE" >>test.md5
How would you verify this as the first megabyte's signature is different than the whole file's?
I've seen this done in other languages, but I am wondering
how to do it in Bash. I've experimented with various md5sum -c
permutations involving pipes and whatnot.
Instead of using md5sum -c
, would you have to recompute the hashes into a new file, then 'diff' them?
You can use a
find /directory/path/ -type f -print0 | xargs -0 md5sum blah blah
to work on a large number of files.
PS: Rsync is not an option
UPDATE 2: So as it stands --
Using head, find, and md5sum; one could then create a file from the source directory fairly quickly, then check it with diff on the other side after computing on the destination. Are there clever one-liners or scripts for this?
Keep in mind that there is an overhead caused by repeatedly launching the required utilities for each file, which would slow things down. A python script or similar would be a good option for this. – tripflag – 2015-10-02T10:58:22.723
1Could using
tail
instead ofhead
be an improvement, since the beginning of a file is less likely to be corrupted than the end? – Sridhar Sarnobat – 2017-01-04T00:05:24.677There is that risk, yes. The issue is the speed and performance of md5sum across a large capacity drive containing very large files. To checksum the entire file is not practical. Thank you for the CRC32 tip, I will definitely check into that. – Bubnoff – 2011-07-29T21:15:30.150
Does the Linux command cksum use CRC32? Both man and info say 'CRC' but not CRC32. – Bubnoff – 2011-07-29T21:26:21.333
Yes, cksum uses CRC32. There are a few variations of CRC32 (or CRC) which could be confusing if you were to compare checksums from other tools. Check out the wikipedia article for more details: http://en.wikipedia.org/wiki/Cyclic_redundancy_check
– jesper – 2011-07-29T21:40:53.890>
used time with both md5sum and cksum - I show that cksum is slower than md5sum. – Bubnoff – 2011-07-29T21:56:05.763
I just tried it on a 3 gigabyte file and they were approximately using the same amount of time. If you really want to just check the presence of files, it's fairly easy to do so. I'll update my answer. Then you should probably do a long crunch to verify the integrity afterwards. – jesper – 2011-07-29T22:04:28.410
md5 and sha1 are likely to be I/O bound on any modern system, and have a much smaller chance of collision than crc32. – afrazier – 2011-07-29T23:36:25.390
Nice one-liner! Thinking I might script this up to handle a directory and output file. Thanks! – Bubnoff – 2011-07-30T01:46:42.813