File copy program that generates checksums of the data while copying too

5

2

My question in short: is there a tool that copies a file from directory A to B while, simultaneously, generating the checksum of the file it has copied/read, without doing an extra read/pass just to generate the said checksum?

I will be copying a few TBs of files from one HDD to another and instead of:

  1. Copy files from HDD1 -> HDD2 (X hours)
  2. Generate checksums of files on HDD1 (Y hours)
  3. Verify checksums of files on HDD2 (~Y hours)

I was thinking of a more streamlined process:

i. Copy files from HDD1 -> HDD2 and generate checksums of the files copied as well (Z hours)

ii. Verify checksums of files on HDD2 (~Y hours)

My assumption is that Z ~= X because the program that can so this will have read the complete file (as it's copying it from one HDD to another) and hence does not need to read the file again just to generate its checksum.

Now I know this idea of mine might not work, if for example, the OS uses DMA to copy the file, and I am not sure what technique Windows 7 uses to copy files from one HDD to another.

Any suggestions to this effect will be appreciated - specially speeding up the copying process and making sure the transfer is 1:1 without corruption or missing files.

PoorLuzer

Posted 2012-12-26T14:19:27.973

Reputation: 590

I am writing something which does just this. The file, typically, is never in memory, only parts of it... Sticking 1gb into memory is bad as we don't know what system resources are per PC, so we copy a few thousand bytes at a time. The checksum can be done during, before or after for sourcefile thanks to multi threading. For destination, only after copy, but also at same time next file is copied...any way, most modern/decent backup programs do this as standard. any way, what is your question? – Dave – 2012-12-26T14:25:08.410

@Dave: We don't need to read the whole file into memory - blocks of it at a time will do. I have multi GB files that might not fit into memory if multiple files were loaded. I just don't want to read a file twice if it can be done once. Is your tool ready to try today? – PoorLuzer – 2012-12-26T14:28:17.853

no, but things like acronis offer a way to verify files (although at a cost). – Dave – 2012-12-26T14:29:48.363

@DaveRook: What language is your tool in? I would like to work on it if it's in C/C++/Java/PERL,Python – PoorLuzer – 2012-12-26T17:36:15.460

No you wouldn't, I would not give it out :D C# – Dave – 2012-12-26T19:07:00.383

Answers

2

Your assumption is not totally correct since bigger files are definitely not stored in memory and in order to increase speed of copying, files are copied in specific size chunks (in Linux, you play around with the size of that chunk in order to increase speed of operations with files). And yes, files are cached in memory. As for DMA - the whole point of this technology is avoiding CPU when copying files and putting them to RAM straight away, so it does not go directly from HDD to HDD. DMA stands for Direct Memory Access.
I would suggest using specific Linux LiveCD solution (such as rsync or very simple scripts), but I understand that this would probably cost more time than save, so it's better if you'd stick with Windows. Try out these:
http://technet.microsoft.com/en-us/magazine/2006.11.utilityspotlight.aspx
http://www.karenware.com/powertools/ptreplicator.asp
http://sourceforge.net/projects/rsyncwin32/
http://codesector.com/teracopy

EDIT
There is a newer, more powerful edition of Microsoft's ROBOCOP: http://technet.microsoft.com/en-us/magazine/2009.04.utilityspotlight.aspx

EDIT 2
If during replication you'll find that something was corrupted, I would doubt that it is safe to use HDD2 for data storage in the long run (as only more sectors will become corrupted).

Ernestas

Posted 2012-12-26T14:19:27.973

Reputation: 605

I never mentioned a need to read the whole file into memory. I am aware that most tools read blocks of it at a time. Also, any checksum algo will work on blocks of a file as well and the tool can keep updating this checksum until the file is copied. My question is: is there a tool that copies a file from A to B while, simultaneously generating the checksum of the file it has copied/read? None of the tools you list do that, but I upvoted your answer for your effort. – PoorLuzer – 2012-12-26T17:32:16.130

I know what DMA is. Does Windows 7 API employ DMA? Is there a tool that does not use the said API but reads a block of data from disc and dump it out somewhere else, thus not employing DMA? – PoorLuzer – 2012-12-26T17:38:27.627

RichCopy supports "Verify" method. rsync does calculate checksum for sure, but it will be harder to implement as it is a command line utility. EDIT: but I found a great GTK port of grsync client: http://sourceforge.net/projects/grsync-win/ Try it. It definitely is supporting checksums and so far it is the best algorithm for make safe copies. No, there is not, because hard discs, nor motherboard chipsets are not able to manage data flows without memory/CPU usage. But you want to calculate checksum, so the file in any case will have to go both through RAM and CPU.

– Ernestas – 2012-12-26T18:37:58.760

1However, I still don't really understand why do you want to "verify" copying files from one HDD to another? It is rather a rare practice as HDD to HDD traffic is not so "volatile" as i. e. network. – Ernestas – 2012-12-26T18:42:52.180