3

After a nasty server crash I was unable to mount a JFS partition on Linux. The jfs_fsck tool returns

Duplicate block references have been detected in Metadata.  CANNOT CONTINUE.
processing terminated:  <date> <time>  with return code: 10060  exit code: 4.

The 12TB partition holds results of scientific computations that can be reproduced in a matter of a few weeks and are not backed up. Though I cannot exclude the possibility of some nonreproducible data lying around due to user negligence.

My plan to recover the partition was as follows:

  1. Replay the journal and mount the partition read-only
  2. Copy the files that can be read to another filesystem
  3. Identify the block with duplicate references using jfs_fsck -v
  4. Identify the inodes corresponding to these blocks with jfs_debugfs
  5. Find the filesystem objects corresponding to the inodes using find -inum
  6. Unlink the objects altogether using jfs_debugfs
  7. Run jfs_fsck again and hope it will complete without an error

This plan did work out only in steps (1) to (4). It first failed in step (5) where find did not seem to get a single inode after running for several hours and could be running forever. When copying files I found some of the directories had their B+trees turn into graphs with loops so it was not impossible that a directory traversal would not terminate.

I jumped straight to step (6) and unlinked first the directories where I could find corrupted structures. But this did not help to make jfs_fsck run to completion. I then removed all the directories but the root directory entry. Yet jfs_fsck still failed to complete.

I guess I have to edit not only the directory structure but also the block allocation maps. However I could not find a way to do it with jfs_debugfs.

Are there tools that can help make a partition with duplicate block references amenable to recovery?

Dmitri Chubarov
  • 2,296
  • 1
  • 15
  • 28
  • *sigh* -- Looks like another causality of backup problems. What is the cost of backup -vs- recreating the data? – mdpc Dec 21 '12 at 18:31

1 Answers1

0

If you can mount the disk R/O at all, you probably could try to copy out the data you can. If the journal is corrupted, it might be that only the last few file changes are lost. Thus you could try to get the files out.

However, if the file is data, how would you know if it is at all correct or hasn't been corrupted itself.

Of course, a journal corruption could also be hiding a more serious disk problem.

At this point, my thoughts would be that to insure the integrity of the data, you'll probably have to rerun the simulations.

mdpc
  • 11,698
  • 28
  • 51
  • 65