4

I have a tricky dilemma. I've got some files on two different destination drives, copied from the same source drive. The source drive had been failing, so I used dd to copy over the data to one destination (with options conv=noerror,sync which fills error'd blocks with zero bytes) and I used ddrescue on the same source drive to copy data to a second partition, and I've heard that ddrescue also fills errors with zero bytes.

Now, I have two destination drives with near-duplicate data, except that some of the data on both of these destination drives is definitely different. I can only presume that the differences are going to be caused by those zero bytes, which seem to be located in different places amongst the data on these two destination drives. I can only presume that these differences are being caused by the parts of the files that have been zero-filled where errors had been encountered during copying. However, the zero-filled spots are different on the two separate destination drives. Most of that data consists of binary files. So some files on the source are fully intact while their counterparts on the destination are not, while other files are fully intact on the destination while their counterparts on the source are not. A lot of these files are binary files too.

Ideally, I'd like to synchronize both drives as follows:

  • Compare each file, bit-by-bit.
  • If the left file's bit is 1 and the right file's bit is 0, copy the 1 over to the right.
  • If the left file's bit is 0 and the right file's bit is 1, copy that 1 to the left, or at least keep the 1 on the right, if two-way synchronization isn't an option.

This functionality makes sense to me, but is there a utility that can handle this automatically? I thought about using rsync for this, but it seems that rsync only checks the file based on size & timestamp or by checksum, rather than bit-by-bit, and a simple checksum won't tell you where there are 0s when there should be 1s. I also looked into rdiff and bsdiff, both of which support binary files, but both of them seem to just output a diff file, rather than doing any actual copying/synchronizing.

So is there a utility in existence that does what I'm looking for, as described in my ideal syncing behavior described above? The OS shouldn't necessarily matter, as I have access to OSX, Windows and Ubuntu.

purefusion
  • 245
  • 1
  • 3
  • 9
  • What you want to do is pretty rough -- it's difficult to determine which copy of the file is "correct" in this situation (is that zero supposed to be a zero, or is it an error?). What you're really asking for is a magic wand that will repair data loss, and I'm afraid that's not something that anyone can offer you in software, beyond what ddrescue already tried or what may be available through a commercial data recovery group. – voretaq7 Feb 08 '11 at 16:38
  • Think about it though... If a zero is *supposed* to be there, and instead, a zero is there because of an error, both sides will have a zero, regardless of an error. Thus, it will stay a zero anyway. No erroneous data will have anything other than a zero. So technically only 1s will need to be copied over. – purefusion Feb 08 '11 at 16:43
  • 1
    @purefusion the problem is the "supposed to be" part. Software doesn't know "supposed to", it knows "is" and "is not". In the pathological case (two copies of a file are bitwise `NOT`s of eachother) your algorithm above will produce a file that's all 1s -- that's almost certainly not what you want... – voretaq7 Feb 08 '11 at 16:46
  • I don't see how that would be the case. If both sides' bits are the same, why would it need to change anything for those bits? It should know that both sides have a zero already, and thus not change anything. – purefusion Feb 08 '11 at 16:50
  • @purefusion -- you've got a very special case where the _only_ errors you expect are zeroed-out blocks. I don't think any existing programs are built to deal with this, since it's such a narrow problem. But even if you do make or find something, what are you going to do about overlapping bad areas from your two sources? – mattdm Feb 08 '11 at 17:11
  • I think it's very likely, given what you've described, that such an overlap is very common in your destinations. In fact, it's most likely that `ddrescue` has gotten everything the `dd` copy does plus, possibly more _hopefully more correct_ data which `dd` just gave up on. – mattdm Feb 08 '11 at 17:16
  • So, really, the answer here is "use the ddrescue version". And, if your bad source drive is still functional at all, you can run ddrescue on it again with the same output file as many times as you want until you've extracted the most possible data. – mattdm Feb 08 '11 at 17:18
  • Well, I used dd first, and ddrescue second, but ddrescue encountered more errors, and I was using error skipping, never bothered to retry the erroneous blocks. But I believe the drive's main issue wasn't actually bad blocks so much as read errors due to a misaligned head, because reading the drive was excruciatingly slow. Thus, I feel lead to believe that each error encountered was at random places on each copy, rather than the same location. Nevertheless, if there are overlaps, I'd still like to sync whatever data I can. – purefusion Feb 08 '11 at 17:28
  • If the data is valuable enough to go to all that trouble, why not use ddrescue as intended? (And, in 20/20 hindsight, why isn't it backed up elsewhere?) ddrescue does what you want not by treating zeros as magical but by deciding to overwrite based on whether it got errors. (It'll never write out _new_ bad-read zero blocks.) – mattdm Feb 08 '11 at 17:34
  • @voretaq7: According to Wikipedia http://en.wikipedia.org/wiki/Bitwise_operation, bitwise NOT operations aren't what I'm after. However, it looks like bitwise OR operations would do what I want, though finding or scripting a utility that does this in a comparison fashion is another story altogether. – purefusion Feb 08 '11 at 17:35

2 Answers2

3

It almost sounds like what you want is a tool, that will retrieve each block of both files, and then do a bitwise OR on each block, and send the output to a new file.

The psuedo-code might look like below. Nothing would happen to identical bits, and bits that where not identical a bit would be set to 1.

while not end-of-files:
  read block file_a
  read block file_b
  merged_block = file_a bitwise_or file_b
  write merged_block to file_c
Zoredache
  • 128,755
  • 40
  • 271
  • 413
  • I was just browsing http://en.wikipedia.org/wiki/Bitwise_operation and noticed the bitwise OR, which seemed to be what I was after. Of course, finding or scripting a utility that does this in a file-comparative fashion is another story. My programming experience is limited to the Web up until now. What would you recommend if I were to script something like this myself? – purefusion Feb 08 '11 at 17:39
  • Python should be pretty easy for something like this. – Zoredache Feb 08 '11 at 17:43
  • I know a bit of python, maybe I'll give it a go... – purefusion Feb 08 '11 at 17:52
0

Rsync should let you do one way syncing. I believe it has a check option too also, to tell you if files differ.

Zoredache
  • 128,755
  • 40
  • 271
  • 413
Jason Tan
  • 2,742
  • 2
  • 17
  • 24
  • If you had read the question, I already mentioned why rsync would NOT work. It doesn't sync bit-by-bit. It will only replace whole files, as far as I'm aware... – purefusion Feb 08 '11 at 16:40
  • rsync will make file A into a copy of file B, but that doesn't seem to be what purefusion wants here... – voretaq7 Feb 08 '11 at 16:43
  • 1
    Rsync does, in fact, use an intelligent binary-diff algorithm, although it works on a block basis rather than bit-by-bit. http://rsync.samba.org/tech_report/node2.html – mattdm Feb 08 '11 at 17:00
  • Ah right, hense the `--block-size` option. Well, since I was using a block size of 512 using DD, I wonder if simply using the same block size with rsync would allow just those blocks to be copied over? After all, when errors are encountered in DD, the whole block would be filled with 0s... ah, but then again can rsync know which blocks are zero-filled vs not? – purefusion Feb 08 '11 at 17:05
  • Yeah, rsync doesn't care about your special-case. – mattdm Feb 08 '11 at 17:30
  • I think you should change the subject of your question to reflect that you're not looking for a general-purpose tool to synchronize based on just _any_ differences in binary data. – mattdm Feb 08 '11 at 17:31
  • What difference does it make if you do it bit by bit or block by block as long as the files are synched (in the one way direction that is)? I understand for you preferred two way comparison that bit y bit is required. – Jason Tan Feb 14 '11 at 12:31