Diagnosing disk health with smartctl

Question

How do you determine if a disk has problems using smartctl?

I have an Ubuntu 12.04 server using software RAID1, which became completely unresponsive. I rebooted, and it hung at boot with the message "/tmp is not ready or not present", so I skipped and started up a manual recovery terminal. Everything seemed fine, except my RAID resync was horribly slow. However, cat /proc/mdstat didn't show any actual RAID failure.

I bumped up my /proc/sys/dev/raid/speed_limit_min following the instructions here, but that didn't help too much. My 1TB array has been resyncing for 30 minutes now, but it's only 0.3% complete.

So I installed smartmontools and checked the disks using:

sudo smartctl --all /dev/sda
sudo smartctl --all /dev/sdb

Both report a "PASSED" health, but sdb is also showing several lines like:

Error 83 occurred at disk power-on lifetime: 15147 hours
Error 82 occurred at disk power-on lifetime: 15147 hours
Error 81 occurred at disk power-on lifetime: 15147 hours
Error 80 occurred at disk power-on lifetime: 15147 hours

along with some sort of hex-dump for each.

What does this mean? Should I interpret these errors to mean my sdb disk is dying? How do I confirm this?

Edit: Also related, ever since the crash, I've now unable to SSH into the server. I can access it just fine from a physical terminal, and there doesn't seem to be any excessive load. I made sure the firewall was disabled, and I can still ping the server, but ssh myuser@myserver results in "Connection timed out".

Yes, those errors mean your disk is dying, even if you see a "PASSED" report. Replace the drive (it's eligible for warranty replacement in this case). — Michael Hampton, Mar 21 '14 at 21:01

score 2 · Answer 1 · answered Mar 21 '14 at 20:13

If one of the disks fell out of the raid, there is likely a reason. I would replace the failed disk (sounds like sdb) and rebuild to that instead. On to the smart data.

There is a big section in the smartctl -a output on the Smart Data Structure. This is a big matrix of words and numbers that tells you the current thresholds for particular tests. Some of the big ones you want to look out for are:

Raw_Read_Error_Rate (id 1)
Reallocated_Sector_Ct (id 5)
Spin_Retry_Count (id 10)
Reported_Uncorrect (id 187)
Offline_Uncorrectable (id 198)

These all relate to issues with the surface of the disk (except for id 10, which is related to the spindle motor). The surface of the disk is most likely to fail of all the things in the drive. If any of these is abnormally high (in the hundreds or thousands), you know for sure there is a big problem.

The registers at the bottom look something like this:

ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

In this case, there was a UNC error on the disk (uncorrectable read/write error).

My opinion is that if you see anything like this:

Error 518 occurred at disk power-on lifetime: 16859 hours

...the disk should be replaced when it is convenient to do so.

The SSH issue may be related to the disk (it could be that the corrupt portion is under the SSH binary), but this is likely something else you should investigate separately.

score 1 · Answer 2 · answered Feb 27 '14 at 06:13

Make sure you're backed up before all else.

Regarding the /tmp error, it's a known bug:

https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1091792

Re: SMART errors:

Get a long test: smartctl -t long /dev/sdb

You can run this anytime. It will take some time. View the results with smartctl -l /dev/sdb when it's done.

And of course make sure you're backed up before all else.

The biggest concern I would have from what you posted is that you seem to have a sudden cluster of errors (at < 2 years activity for the drive). A failed drive should not take your system down, however (in fact you should see a lot of other noise in the logs around the time it froze). Occasional errors are pretty normal, a lot at the same time are cause for concern.

SMART is useful for early warning sometimes, you certainly can't rely on it alone.

So it wouldn't hurt to back up again. But I don't think you have any reason to panic.

I'd like to clarify your point about early warning: If SMART (error log/attribute/overall health check) give you evidence to suggest the drive is not healthy, you can be very sure the drive is not healthy and you should replace it. But the absence of these indicators doesn't mean the drive won't have issues in the near future. SMART is a perfectly good indicator of a failing drive, it's just not a good indicator of a healthy one. — Daniel Lawson, Feb 27 '14 at 09:41

score 1 · Answer 3 · answered Feb 27 '14 at 09:38

Many of the attributes in the SMART attribute table are useful indicators of failing drives. Could you update your post with the output of 'smartctl -data -A /dev/sdb' ? The attribute table is drive-dependent so I can't list the ones that will be relevant, other than the fairly generic ones like 'Reallocated_Sector_Ct', 'Offline_Uncorrectable', etc. The Wikipedia page on SMART contains descriptions of most attributes.

The SMART self test that quadruplebucky is useful too, but those attribute counters can tell you right away if a drive is failing. The drive might not trigger the overall SMART health warning but still be obviously on the way out

score 1 · Answer 4 · answered Dec 23 '16 at 22:14

1

Regarding your backup - waiting for a SMART error or warning is too late to do your backup. Best practices would include a tested backup plan, plus sufficient redundancy in the storage subsystem to handle anticipated hardware failures.

answered Dec 23 '16 at 22:14

ttwalkertt

11
2

1

This doesn't answer the question, it'd be better suited as a comment. – Adam Gibbins Dec 28 '16 at 01:35

Diagnosing disk health with smartctl

4 Answers4