5

I'm downloading a large file over http via wget, 1.2TB. The download takes about a week and has contained corruptions twice now (failed md5 check, which takes days to run by itself).

Is there a good way to validate the file piecemeal over http using say curl? Or to break it into separate blocks such that I could identify a specific bad block and redownload just that section?

The file is a tar archive, so I believe corruptions per block could be identified sequentially during unpacking.

davidparks21
  • 878
  • 1
  • 11
  • 25
  • 3
    Drop the file onto a hard disc and mail it - gotta be quicker or use a protocol that will detect the issue and fix the problem before you know about it. – user9517 Oct 24 '17 at 21:03
  • 1
    Use rsync instead - it can redownload only the broken pieces automagically. – Martin Schröder Oct 26 '17 at 09:35
  • @MartinSchröder RSync is nice and would have been the primary tool to begin with,, but it does require SSH access to the remote server. If he only has HTTP that isn't possible. – Tonny Oct 26 '17 at 10:54
  • Right, I do not control the server in any way shape or form. rsync would be ideal but it's certainly not possible in this case. – davidparks21 Oct 27 '17 at 17:46
  • How poor is your storage that a md5sum takes *days* ? – John Mahowald Oct 28 '17 at 22:38

3 Answers3

6

On the server side, you can use dd and md5sum to checksum each chunk of the file:

#!/bin/bash
FILENAME="$1"
FILESIZE=`stat --printf="%s" $FILENAME`
CHUNKSIZE=536870912 # 512MB
CHUNKNUM=0
while ! grep -q 'cannot skip' hash.log 2> /dev/null ; do
    dd if=$FILENAME bs=$CHUNKSIZE skip=$CHUNKNUM count=1 2> hash.log | md5sum >> $FILENAME.md5
    CHUNKNUM=$(( CHUNKNUM + 1 ))
done
rm hash.log

You will be left with a single $FILENAME.md5 file with all chunk hashes.

You can now download that large file and the checksums, run this script on the file and compare the hashes. If any piece gets a mismatched hash, you can use curl to download only part of the file (if the server supports RANGE) and patch the file with dd.

For example, if the chunk 2 gets a hash mismatch:

curl -s -r 536870912-1073741824 | dd of=somelargetarfile.tar seek=536870912 conv=notrunc

This will download the chunk 2, and patch the large tar file with it.

ThoriumBR
  • 5,272
  • 2
  • 23
  • 34
3

ThoriumBR's answer is good, but I would like to add some additional advice in case you can't access the remote server.

You already have one (or more) bad download(s) locally.
Using the split trick given by ThoriumBR you can split those files locally and make use of the good parts.
Compare each of those chunks with the same chunk downloaded using curl (as per ThoriumBR's last instruction). If you have 2 identical chunks (binary diff, no need for slow md5) you can be relatively certain that is a good chunk. So save it somewhere else and repeat with the next chunk.

So: For each chunk: Compare your local copies (if you have more than 1) and add freshly downloaded copies and compare until you find 2 identical chunks: That is the one to keep.

It is a fair bit of manual work, but doable. You can even script the whole process, but doing that (and debugging the script) may not be worth the effort.

Tonny
  • 6,252
  • 1
  • 17
  • 31
  • That still means downloading the whole file again, and there's no reason to expect the freshly downloaded chunk won't be corrupted. – GnP Nov 01 '17 at 13:04
  • @GnP True, but if the OP only has HTTP access to the remote server I don't really see any other way. – Tonny Nov 01 '17 at 14:20
0

On the source server, create a BitTorrent .torrent and add the existing location as a web seed URL. BitTorrent will verify the chunks. Any client that manages to download a copy can seed it, if desired.

This does require a good copy of the file to create the .torrent. Very similar to ThoriumBR's solution, with different tools.

If you still have the failed files and/or checksums, compare every and checksum. The same result each time could indicate your transfer is correct but the remote file disagrees with its known checksum.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32