How can I tell if a disk is failing on ESXi / what do these errors mean?

Question

I have a server running VMware ESXi v4.1.0 348481. It has a hardware RAID10 and a SATA backup drive. I have a VM running which has it's primary boot vmdk on the RAID10 datastore, and a 600 GB vmdk on the SATA backup drive's datastore. The VM runs Debian linux with the FreeBSD kernel, and uses ZFS for the backup drive.

EDIT: The drive is not directly attached to the VM. It is used as a VMware Datastore, and the VM has a vmdk on the SATA drive's datastore. The datastore is not full (only 65% full)

I logged in to the server using SSH and found that last night backup was hung, and zfs list or zpool list both hung. So I opened the virtual console in ESXi and was sad to see:

This screenshot makes me sad

(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)
(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)

I tried to reboot the VM and I received a message that the system was going down for reboot, and then that hung. (^C appears but does not kill shutdown). I cannot interrupt or kill -9 the zpool list zfs list or rsync processes -- Nothing happens when I try.

Does this ndicate the backup SATA drive is failing? Or could this just be an ESXi error?
How in the vSphere client could I tell if the drive is failing? I didn't see any indication, everything under Hardware Health Status looks good, and I saw nothing under the Storage config.
How should I proceed from here? Should I just hard reboot the VM?

UPDATE: I just hard rebooted the VM. After it came back online, the backup zpool was online, however:

root@timestandstill:/home/jnet# zpool status -v
  pool: backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        backup      ONLINE       0     0     0
          da1       ONLINE       0     0     0


errors: Permanent errors have been detected in the following files:

        /backups/someserver/home/someuser/public_html/somedir/calendar/someuser/calendars/somefile.ics

I am leaning heavily towards replacing the drive...

score 5 · Accepted Answer · answered Aug 26 '11 at 20:05

5

Those errors do indeed indicate one of the SATA drives has encountered enough bad blocks to run out of block-reallocation space. That drive is bad, and needs to be replaced. I'm not sure where in the vSphere client that information is displayed, but the log entries are pretty clear.

If your hardware allows it, a hot-swap should be doable. Otherwise, you're going to have to shut everything down to perform the change-out. If that VM hasn't shut itself down after 30 minutes, it's time to just hard-terminate it. It's risky, but if it is truly hung there isn't much for it.

answered Aug 26 '11 at 20:05

sysadmin1138

131,083
18
173
296

Thanks @sysadmin1138. So, even though those errors show inside the VM, it clearly indicates a true hardware problem? I.E. it's not just some VMware quirk? – Josh Aug 26 '11 at 20:08
@Josh If you've direct-presented the SATA drive to the VM, then you definitely would see that. For file-backed storage, I'm less certain; it could be that your datastore has run out of space. – sysadmin1138 Aug 26 '11 at 20:18
I should have clarified that. It is file-backed storage. The datastore has plenty of space: it's a 1TB drive and the VMDK is 600 GB, nothing else is on the disk. Either way, it's clearly something wrong, so unless someone else gives me a clear explanation of what this could be, I will replace the drive. – Josh Aug 26 '11 at 20:20

How can I tell if a disk is failing on ESXi / what do these errors mean?

1 Answers1