4

I've just booted to find my software RAID 5 in Ubuntu not mounting. When trying to mount it gave me an NFS error (which was confusing). I ran fsck on /dev/md0 and my screen scrolled with fixes for about an hour. It claimed to be complete, however I've mounted it and the folder structure is empty. It just has a lost+found folder containing hundreds of files like the screenshot below:

Very confusing files!

SimonJGreen
  • 3,195
  • 5
  • 30
  • 55
  • 1
    This is why RAID is not backup. – David Schwartz May 06 '12 at 23:11
  • I didn't say it was backup, I'm asking what could have caused this and ideas to repair it. Failing that ways to check it won't happen again. – SimonJGreen May 07 '12 at 07:47
  • RAID doesn't protect against filesystem corruption, it only replicates it very quickly across drives. If your memory was failing, if an application or bug reared its head, if a drive started going nutters and it replicated the error across other disks, etc. then you're going to have issues. RAID only helps protect against a straightforward physical drive dying. It won't save you from filesystem-level corruption. – Bart Silverstrim May 07 '12 at 12:15
  • Unless you're well versed in trying to recover data from the block level and navigating inodes, or unless you get very lucky running the "file" command and figuring out what those fragments of files are so you can piece them back together, your filesystem is purdy hosed. You can either hire a company to recover the data or write it off if there's no backup. You might find a filesystem guru who can piece things back together but you're going to pay quite a bit for it. – Bart Silverstrim May 07 '12 at 12:19
  • You may want to run the same machine through memtest86. Nothing corrupts a filesystem or data faster than a bad spot of RAM that is being used as a buffer. Having detected and sent back 2 sticks of RAM (and I'm on a 3rd now) I can say that consumer-grade memory isn't what it's always cracked up to be. Hopefully you have ECC RAM. – Avery Payne Jan 26 '14 at 03:15

1 Answers1

5

Looks like the filesystem was hosed and the fsck didn't fully repair it.

At this point I'd be tempted to check the logs to see if the disks are all physically working (noises? SMART status? errors in the logs regarding resets? etc.) and restore from backup rather than spend more time trying to straighten out the results of the fsck.

Bart Silverstrim
  • 31,092
  • 9
  • 65
  • 87
  • Cheers for the idea, however this was a 4TB NAS used for backup, mdadm shows all disks in the array are fine, all the disks SMART ok. 90% of the data can be recovered from else where but there is a good 150-200GB that was moved from working machines to here as central storage. – JTotham May 06 '12 at 22:08
  • 2
    Little confused...if the NAS was used for backup, that implied there is another copy of the data somewhere to rebuild from? But if this is a central file sharing server is there any tape backup or drive backup that the server is being backed up to? – Bart Silverstrim May 06 '12 at 22:56
  • This was a home NAS used mostly for backup but also as shared storage so a tape library isn't really practical. There is no 2nd backup of the main data and no backup at all of the shared data, my problem i know and one that i was in the process of fixing. Once rebuilt / recovered i am going to setup some online backup space that the shared data can be replicated too. The most annoying thing is that its was fine 2 days ago and then the filesystem just died with no warning. – JTotham May 07 '12 at 06:48
  • Smart reports ok. Can you explain what you mean by "noises"? – SimonJGreen May 07 '12 at 07:46
  • Hard drives physically failing but still nearly working still often have the controller/OS send resets to the drive to correct for read/write errors. You can hear a series of clicks or thunks from the drive motor when they do this. – Bart Silverstrim May 07 '12 at 12:13
  • Ah yes I've heard disks do that before. Not in this instance though. Although shouldn't RAID5 protect against single disk failure like that anyway? – SimonJGreen May 07 '12 at 16:46
  • In theory. If they're large disks, you may have an unknown URE, which is why many/most admins avoid RAID 5 now. I was bitten by a URE with a hardware RAID controller. – Bart Silverstrim May 07 '12 at 16:59