1

There seems to be something wrong with my hard drive, but I'm not sure what, or how to proceed. The first sign of any problems was this:

I tried making a new directory on my server, but when I did so, it hung for like 30 seconds, then gave this error:

root@smallgames:~# mkdir derpherp
mkdir: cannot create directory `derpherp': Input/output error
Message from syslogd@smallgames at May  1 18:09:17 ...
 kernel:[8731601.569393] journal commit I/O error

I then tried running fsck:

root@smallgames:~# fsck
fsck from util-linux 2.20.1
e2fsck 1.41.12 (17-May-2010)
/dev/vda1: recovering journal
fsck.ext3: Bad magic number in super-block while trying to re-open /dev/vda1
e2fsck: io manager magic bad!

Running it again gives this:

root@smallgames:~# fsck
fsck from util-linux 2.20.1
fsck.ext3: Unable to resolve 'UUID=e4565c70-2bcd-40c8-ac8a-dab5bab4167c'

Running ls on anything now gives me an empty directory. This is Debian running in a VM in Proxmox.

Running dmesg on the main server gives a lot of these:

ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/08:00:b9:c1:34/00:00:3f:00:00/40 tag 0 ncq 4096 in
         res 41/40:08:c0:c1:34/00:00:3f:00:00/00 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
ata1: EH complete

Output of mdadm --detail /dev/md*:

root@ks212866:~# mdadm --detail /dev/md*
mdadm: /dev/md does not appear to be an md device
/dev/md1:
        Version : 0.90
  Creation Time : Sat Nov  3 22:07:42 2012
     Raid Level : raid1
     Array Size : 10485696 (10.00 GiB 10.74 GB)
  Used Dev Size : 10485696 (10.00 GiB 10.74 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Wed May  1 21:42:44 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : da7935e9:ed88ed4b:a4d2adc2:26fd5302
         Events : 0.67258

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed

       2       8       17        -      faulty spare   /dev/sdb1
/dev/md2:
        Version : 0.90
  Creation Time : Sat Nov  3 22:07:43 2012
     Raid Level : raid1
     Array Size : 965746624 (921.01 GiB 988.92 GB)
  Used Dev Size : 965746624 (921.01 GiB 988.92 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Wed May  1 21:42:59 2013
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 70302f6a:598cdf5f:a4d2adc2:26fd5302
         Events : 0.351218

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       0        0        1      removed

       2       8       18        -      faulty spare   /dev/sdb2

1 Answers1

5

Congratulations. You have encountered an uncorrectable read error on your first drive while your second drive had already failed.

I recommend replacing both drives. Start by replacing the second drive, wait for the rebuild to compete (if it doesn't fail, taking your entire data set with it, which is a real possibility), then replace the first drive. Then take a backup of everything. Finally, run fsck on your host, then within your guest.

However, you will likely not be able to get your data back. With drives that large, the chances of encountering an unrecoverable read error during the resync start at likely and get worse from there.

longneck
  • 22,793
  • 4
  • 50
  • 84
  • It is in RAID. I have 2x 1TB hard drives in RAID. I actually just remembered this and came to this page to ask how it's possible for my data to have got corrupted if only one of the drives has died. I don't have backups *because* it's in RAID. :/ – Leagsaidh Gordon May 01 '13 at 18:48
  • What RAID level? 1 or 0? – longneck May 01 '13 at 18:52
  • 4
    Also, RAID is not backup. Correctly configured RAID can protect you from many types of hardware failure. It does not protect you from software failure or user error. – longneck May 01 '13 at 18:53
  • RAID 1. Also, this was a hardware failure. – Leagsaidh Gordon May 01 '13 at 19:09
  • Please edit your question to include the output of `sudo mdadm --detail /dev/md*` – longneck May 01 '13 at 19:24
  • see my updated answer. – longneck May 01 '13 at 21:54
  • 1
    Assuming, of course, that the first rebuild _ever_ completes, which it may well not. It would be faster and more sure to replace both drives and restore your backups. – Michael Hampton May 01 '13 at 22:02
  • @michaelhampton good point, the possibility of getting another read error is pretty high in this situation. – longneck May 01 '13 at 22:07
  • 2
    @SeanGordon The first disk failing was hardware failure. Not replacing it before the other disk failed was user error. Congratulations, you've lost all your data. Here's hoping it will be the last time it ever happens to you: Take backups! – Michael Hampton May 01 '13 at 22:11