How to diagnose disk errors when disk appears to be ok?

I have a six-month-old 1TB Seagate drive formatted into 2 NTFS partitions, and the disk appeared to be failing with Windows dropping down from UDMA to PIO mode, reporting Delayed Write Errors, and hanging Explorer when browsing directories. My initial suspicion was that the disk was dying.

However, on further examination it appears that Ubuntu, which doesn't write to the volume frequently like Windows does, was able to read the disk properly and retrieve all the data intact, saving me from having to use an older backup. Finally, running the Seatools DOS diagnostic reported that the disk has no problems, ie. SMART errors and no bad sectors, apparently.

This, in combination with the relative youth of the disk, suggests that something else is broken. The cable? The PSU? The integrated disk controller? But what would be a good way to diagnose the problem without risking damaging the data? I intend to extract the disk and try it in an external eSATA enclosure and see if the write errors cease, but in the event of the disk appearing to be fine, I would like to be able to confirm what part of the hardware is actually broken here in order to know just what needs replacing.

Are there any good ways to go about this?

Kylotan

Posted 2011-11-12T01:51:32.920

Reputation: 400

There's not a lot you can do other than replace the disk and/or HDD cable. If it's a fault with a component of the disk, it's not worth the cost of repair, warrant the disk. If it's a fault with your motherboard you're stuffed. Having said that, I've had massive problems with the Seagate disks I've deployed (having to replace almost all of them over the last 2 years) so I've switch to Western Digital. – Dom – 2011-11-12T05:56:09.687

I know I need to replace something, but what? That's the question.

Incidentally, a Western Digital was the only drive I ever had die within a couple of weeks of purchase so I won't touch them again. In 10 years this is the first Seagate I've had issues with - and it might actually be in perfect condition. – Kylotan – 2011-11-12T13:33:56.527

Answers

Get a copy of HD Tune , SpeedFan, or Hard Disk Sentinel (my preference) and evaluate the SMART data that's been stored on the drive. Look in particular at columns like Ultra ATA CRC Error Count, and Reallocated Sector Count. Compare to a known good system.

FYI: SMART errors vary and certain manufacturers use seemingly random data for particular SMART values that sometimes make it appear a drive has rampant errors when in fact all drives of that make/model will have high SMART values (and I've seen this on Raw Read Error Rate). So be careful not to jump immediately to conclusions. But if you have Ultra ATA CRC Error Count errors more than 2, and Reallocated Sector Count more than 2, I would feel pretty confident saying something is going wrong. Reallocated Sector Count suggests the drive, Ultra ATA CRC Error Count suggests the cable or controller.

Syclone0044

Posted 2011-11-12T01:51:32.920

Reputation: 1 222

Where would I find a known good system? I don't have another disk of exactly the same make and the values themselves appear almost meaningless. The suspect disk has Ultra ATA CRC Error Count of 200, but says the 'worst' value for it is 171. And another disk of a different brand has 200/200 for those values, but has never exhibited any problems. So I don't think your "more than 2" rule applies here? – Kylotan – 2011-11-12T18:22:18.630

Also, thanks for being the only person so far to have mentioned anything at all that distinguishes between disk errors and controller/cable errors, which is mostly what I was asking about! – Kylotan – 2011-11-12T18:23:29.957

Sorry to add a 3rd comment, but I just found this on the Seagate site, basically stating that their SMART values don't necessarily correspond to any well-known values and that you need to use their tool to interpret them (http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=203971) However their Seatools program said that the disk is fine.

– Kylotan – 2011-11-12T18:35:13.077

Can you post your SMART data like this guy did?

– Syclone0044 – 2011-11-13T04:32:49.483

I don't think there's much point given that people on that other thread just used Wikipedia to scare him about the 200 values which are actually just the factory defaults. Most of the values are exact 100s or 200s, as I remember. (The disk isn't attached currently or I'd be more certain.) – Kylotan – 2011-11-13T13:29:21.720

Before you do anything else, back all the data on the disk up. Then you can do whatever you want without risking the data. You should, of course, have a backup anyway, because a disk can fail with on warning whatsoever. But when you have a warning, there's no excuse for not immediately backing up everything you care about.

David Schwartz

Posted 2011-11-12T01:51:32.920

Reputation: 58 310

Ok, but that's not really answering my question - how do I find out what is actually broken? If I misdiagnose the problem then I risk damaging future data - and producing corrupted future backups as a result. There's no point me blindly taking copies of this disk on a daily basis if it turns out that something is silently corrupting it. – Kylotan – 2011-11-12T13:32:33.110

"SMART errors" is too vague. There are on the order of 50 SMART attributes that can be monitored, and only some of them may indicate a bad sector.

One advantage of SMART over one-off diagnostics is that it can surface issues over time, that may otherwise be intermittent and elusive.

If you are not monitoring or measuring the attributes, you may be ignoring the information that could answer your question.

There are several monitoring applications available:

http://www.ntfs.com/disk-monitor.htm

http://www.ariolic.com/activesmart

Greg Askew

Posted 2011-11-12T01:51:32.920

Reputation: 259

Re: vagueness - the Seagate tools stated that every value was 'ok'. Another test I ran (though I can't remember which, sorry - might have been Ubuntu running the disk self test) said all the relevant values were "Always passing". The worst values had never crossed the threshold values in any case, apparently. The Windows 7 disk check has just given it a clean bill of health too so I'm really interested in seeing if something else is broken. Still, thanks for the links! – Kylotan – 2011-11-12T18:03:27.610