How to get a SMART offline test to complete on an active drive?

2

I would like to run a SMART offline test on one of my hard disks (internal SATA). The machine is running Ubuntu 14.04, so I simply do smartctl -t offline /dev/sdb (as root). It starts the test and gives me an estimated time of completion.

The drive in question holds the system's root fs, so it's being actively (but not heavily) used. So when I later run smartctl -a /dev/sdb, even well after the estimated time, I see "Offline data collection status: (0x04) Offline data collection activity was suspended by an interrupting command from host." It's not clear if the test is ever going to finish.

My understanding is that the offline test essentially checks every sector on the disk to see if it can be read. When the computer accesses the disk, the test is suspended and resumed after the command finishes. But it seems that there are enough commands being sent that very little time is spent on the test, so it progresses extremely slowly or not at all. (I also wonder if there is an intentional delay between the completion of the command and the resumption of the test, to avoid switching back and forth too frequently.)

Is there any way to somehow prioritize the SMART test higher, so that it makes progress at a reasonable rate, while still keeping the disk accessible? It would be fine if disk access is slower, as long as the system can still run. I know about captive mode -C, but this would make the system unusable while the test runs. I could of course boot from another disk and run the SMART test with sdb unmounted, but that also would make the system effectively unusable for the duration (and it requires physical access to the machine, which happens to be inconvenient).

I saw SMART-Test never finishes, but that seems to be the opposite problem: the disk has no activity and enters standby mode. I see no evidence of that being the case here, as the disk is active. I also saw SMART short offline test never ends for all drives of a RAID1 on ServerFault, but the answer suggests controller or cabling problems, which I have no reason to suspect as the drive is generally working fine.

(As an aside: is there any way to check the progress of an offline test? That would give me some idea as to whether the test has any chance of completing in a reasonable amount of time.)

Nate Eldredge

Posted 2015-08-08T22:17:19.227

Reputation: 181

Answers

1

If the command

sudo smartctl -c /dev/sdb

says

Offline data collection capabilities:
  Suspend Offline collection upon new command.

then your test should complete ok. Perhaps you have smartd running in the background doing occasional commands?

You could split your -t offline into smaller ranges of blocks. For example,

sudo smartctl -t select,0+10000 /dev/sdb

then when it is done, repeat this command with a wait for each result:

sudo smartctl -t select,next+10000 /dev/sdb

meuh

Posted 2015-08-08T22:17:19.227

Reputation: 4 273

I don't doubt that the test will eventually complete ok; the question is when - will it be before I die of old age? The estimated time for offline data collection to complete was 152 minutes, and it has now been 48 hours since it began, with no apparent way to gauge what progress has been made. smartd isn't running. – Nate Eldredge – 2015-08-09T23:03:32.020

@NateEldredge The progress of offline testing is actually shown by smartctl -c under Total time to complete Offline data collection: which is supposed to change each time to reflect the current time to wait until done. – meuh – 2015-08-10T09:34:25.100

Have you tested this? For me, "Total time to complete" always says 1920 seconds. And other disks I've tried always report the same "Total time to complete" while a test is running, even though it does eventually finish. It seems to me this would be the natural interpretation of the word "total". – Nate Eldredge – 2015-08-10T18:56:03.820

@NateEldredge I tested it with my SSD and it reduced gradually, but there may be a max value the field can hold. Mine wasnt above 1920. – meuh – 2015-08-10T19:00:48.940

Interesting. On another machine with an SSD the "Total time to complete" is always 10 and does not reduce, but it does finish in about 10 seconds. The machine that reports 1920 is a regular hard disk. – Nate Eldredge – 2015-08-10T19:20:40.353