Why do damaged hard drives freeze the entire system?

129

35

Why does a hard drive which is known to have bad blocks (verified in HDTune and HDDScan), freezes down my entire system?

It is not the OS drive; it is attached to another SATA port, and I'm trying to copy files from it to another healthy drive.

I have experienced this issue with almost every damaged hard drive and every Windows PC.

I would expect to see freezing only for the program I'm using to copy the files (Windows Explorer, etc.), but instead my entire PC gets jerky, and I cannot browse web or watch movies while copying files from the damaged drive.

The long story.

I live in a rural area where there are problems with electricity (brownouts, etc.). I myself am using a UPS and my own hard drives are perfectly fine. But my neighbors often ask for help with their PC issues, and I often find that their hard drives are damaged, most probably because of electricity issues. Of course, after replacing the damaged drive I suggest my neighbors to buy an UPS.

I have always wondered, why my PC freezes entirely while retrieving data from damaged drives. Is it a hardware issue? Is it caused by the way OS reads data? Is it something Windows-specific, and I won't experience it on *nix?

Anyway, from now on I will use some dedicated software (such as Roadkil's Unstoppable Copier) instead of Windows Explorer, although I'm not sure if this will work differently, without freezing entire PC.

It is not a request for help, it is more for educational purposes, so I know why the things work that way.

JustAMartin

Posted 2015-08-09T12:35:46.990

Reputation: 1 349

11Using an external USB enclosure should help, since you are no longer tieing the faulty disk to your system SATA controller (also, adding an extra layer of sacrificable hardware between your motherboard and a faulty disk is always a good idea). – Matteo Italia – 2015-08-09T16:06:32.317

3It's not specific to SATA, IDE drives did this as well. Also just because the disk is damaged doesn't mean the controller isn't, especially if an electrical fault damaged the disk. – Chris H – 2015-08-10T09:02:41.257

The accepted answer is awesome, and contains what I was going to say and so much more. Basically you are panicking your SATA controller, which is a super-important system device, which in turn panics Windows. I do wonder if enabling AHCI / "hot-swap" in BIOS would improve the situation though. – Arthur Kay – 2015-08-11T18:51:01.377

Answers

169

This is one of those areas where SATA is suboptimal. The problem is at the storage device interconnect protocol level, and thus not related to what software you are running. Using another file copier or another operating system won't magically make things better, except that it might try to set different timeout values to reduce the impact of the problem (which may or may not be possible depending on the hardware and firmware; see below).

There are a few important points here:

  1. With SATA, if the drive stops responding, this can tie up the whole storage system, not just the one drive that is having problems. It certainly has the potential to tie up the whole controller, and since most consumer systems have only a single disk controller (the one integrated on the motherboard), this means all storage. It's even worse if the drive fails in some non-standard and/or unexpected way, which can certainly happen if the drive is marginal. You may be interested in How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt? on Server Fault.
  2. Most consumer SATA drives have long default timeout periods (on the order of minutes) and many consumer SATA drives lack configurable error recovery control. So-called "NAS" drives often have configurable ERC, and high-end drives virtually always do; such drives may also have shorter default timeouts (7 seconds being a common value). Long timeout periods are advantageous if the drive holds the only copy of the data, which unfortunately is common on consumer systems; they are a disadvantage in a redundant configuration or where you simply want to get as much as possible off the drive before it deteriorates further.
  3. A drive will keep trying to read a bad sector until it reaches its timeout threshold or until an abort is signalled by the host. Since the SATA bus can be tied up by the wait for the read to finish, it might not be possible for the OS to signal a storage-level command abort, and in extreme cases, drives might not even respond well to a SATA bus reset in such a situation.

Point #1 is one of the main selling points for SAS on servers; SAS has significantly better error handling than SATA. Point #2 is a drive firmware limitation, and #3 becomes a problem really only because of #2.

So what happens is that the OS issues a "read sectors" command to the disk, and the particular sectors are somehow damaged. Thus, the disk goes into retry mode to try to get the data off the platters, trying the read again and again until it gets good enough data that the disk's own error correction (FEC) is able to correct for the remaining errors. If you are unlucky, this might be never, but the drive will keep trying for some fairly long period of time before deciding that this read isn't going to succeed.

Because the operating system is waiting for the read, this will at the very least slow down the copying process to a crawl, and depending on the exact OS architecture can cause the OS to become jerky or even freeze for the duration. The disk, at this point, is busy with the original read and won't respond to further read commands until the one that is currently executing ends (successfully or unsuccessfully), and other software generally won't do better than the operating system it is running on.

Hence, anything that triggers a read elsewhere (ideally, only on the damaged drive) is going to have to wait in line until the damaged drive either successfully reads the sector in question, or determines that it cannot be read. Because of SATA's less than optimal handling of nonresponsive drives, this can mean that not only the drive you are copying from is going to have its I/O delayed. This can very easily cause other software to become slow or unresponsive as well, as that software waits for a different I/O request to finish, even if the operating system is able to cope.

It's also important to note here that disk I/O can happen even though you aren't explicitly accessing any files on disk. The two main causes for this would be load-on-demand executable code, and swap. Since swap is sometimes used even when the system is not under memory pressure, and load-on-demand executable code is common on modern systems and with modern executable file formats, unintended disk read activity during normal use is a very real possibility.

As pointed out in a comment to the question by Matteo Italia, one mitigative strategy is to use a different storage interconnect, which is a complicated way of saying "put the disk in a USB enclosure". By abstracting through the USB mass storage protocol, this isolates the problematic SATA portion from the rest of your system, which means that in theory, only I/O on that specific disk should be affected by I/O problems on that disk.

As a bit of an aside, this is pretty much why SATA (particularly, SATA without drive-level ERC) is often discouraged for RAID (especially RAID levels with redundancy, which among the standard ones is all except RAID 0); the long timeout periods and poor error handling can easily cause a whole device to be thrown out of the array for a single bad sector, which the RAID controller could handle just fine if redundancy exists and the storage controller simply knows that this is the problem. SAS was designed for large storage arrays, and thus with the expectation that there will be problems on various drives occasionally, which led to it being designed to handle the case of a single problematic drive or I/O request gracefully even if the drive doesn't. Problematic disks are not very common in consumer systems simply because those tend to not have many disks installed, and the ones that are installed virtually never have redundancy; since SATA aimed to replace PATA/IDE not SCSI (the latter being the niche SAS aimed for), it is likely that its error handling features and demands (or guarantees) were considered adequate for its intended use case.

a CVn

Posted 2015-08-09T12:35:46.990

Reputation: 26 553

I was under the impression that this problem was supposed to have been solved by switching from ATA to SATA. What did I miss? – kasperd – 2015-08-09T17:22:09.453

@kasperd I'm not sure exactly what you mean. SATA is a single device per channel, whereas PATA/IDE can attach two devices per channel. This limits interference between SATA devices, but if the controller is having problems because a drive is nonresponsive, then that will affect all ports on that controller. Long default timeouts are good in most situations where SATA drives are used, but bad in some. High-quality controllers which cleanly separate each channel might fare better, but those are unlikely to be integrated onto consumer motherboards. (Or you could run SATA drives off a SAS HBA...) – a CVn – 2015-08-09T20:20:42.643

20Thanks for actually posting a sensible answer that explains what's going on. This is the kind of question where I usually see vague answers like "because the system is waiting for the drive" or "because it's designed that way". – user541686 – 2015-08-10T02:59:30.097

Also: The reason that tying up storage causes the entire system to freeze (even if you're not explicitly accessing storage)? Two words: virtual memory. – user253751 – 2015-08-10T07:26:52.763

@MichaelKjörling So what I get from what you are saying is this. The SATA protocol was designed to not have this problem. But that doesn't stop cheap controllers from being build with a poor design that suffers anyway. – kasperd – 2015-08-10T07:56:36.217

4@kasperd: Pretty much. Although part of it is Windows' "fault" as well, as it can happen just as easily with multiple controllers. IMO this answer is a bit deliberately vague, seeing as enterprise SAS controllers aren't immune to the issue either. It really just boils down to certain blocking I/O requests. Some hard drive operations require operation X to be guaranteed to be finished before operation Y, and if X never finishes, Y can never begin - and anything after Y becomes stuck as well, nomatter whether the drive, controller, driver, or OS is at fault. – qasdfdsaq – 2015-08-10T10:44:21.423

I guess, we would need async hardware with async drivers and async OS kernel to be able to show a message like "Please wait, trying to read data from your disk, and it seems to take more than expected" instead of freezing entire system. – JustAMartin – 2015-08-11T08:26:45.113

2@JustAMartin Actually, it's almost all async already - any peripheral that supports DMA these days is full on asynchronous; the kernel only schedules the requests and handles the interrupts that signal the request is done. The problem is that sometimes you must wait for the operation to complete - and in the process, they can block something important. As user20574 noted, virtual memory is one of those, but there's plenty of things that need some guarantees. Some parts of the kernel aren't asynchronous, and of course, some drivers/devices simply suck. – Luaan – 2015-08-11T09:04:48.110

2@MichaelKjörling "Because the operating system is waiting for the read, this will at the very least slow down the copying process to a crawl, and depending on the exact OS architecture can cause the OS to become jerky or even freeze for the duration." -- Why exactly does the OS become jerky in the case of reading from a secondary (non system) drive? The problem can't be entirely due to the error handling behavior of the SATA controller. I think this answer could benefit from information about how Windows handles errors in its disk subsystem. – Jordan Rieger – 2015-08-11T20:51:49.857

@JordanRieger I'm not familiar with the intricacies of Windows' HAL, I/O, disk driver and process scheduling subsystems (which are likely the ones that would be involved), and I don't want to add information to the answer that I don't feel confident is correct. Especially so now that it is so highly upvoted; this is my highest voted answer throughout the entire Stack Exchange network at this point. – a CVn – 2015-08-11T21:00:14.510

1@MichaelKjörling Fair enough. The answer has lots of good info, but I think it doesn't quite explain the OP's specific scenario. To come at it from a different angle, can you cite any reference to back up your point #1: "With SATA, if the drive stops responding, this can tie up the whole storage system, not just the one drive that is having problems. It certainly has the potential to tie up the whole controller."? This seems like a terrible design. Isn't it the OS disk subsystem the more likely culprit? I.e. the controller is asynchronous, but the OS driver sometimes blocks unnecessarily. – Jordan Rieger – 2015-08-11T21:11:51.993

@JordanRieger I was going to say the same thing. If the OS's drivers can't handle being blocked by the hardware, the entire OS will freeze, possibly forever. – phyrfox – 2015-08-12T14:48:49.043

1It's not just because of the controller. I see it frequently even when there's no contention for the controller--there's a normal everyday operation that causes the same problem: Inserting an optical disk. If the drive is locked by burning software there's no problem, otherwise the whole system freezes until it spins up. My drive is on a separate controller from the OS. The worst result should be freezing anything accessing the data on the same controller but that's not what happens. – Loren Pechtel – 2015-08-13T02:15:37.493

3

As was stated above, the issue with system freezes due to a bad hard drive is primarily due to long attempts by the drive to recover unreadable data from bad sectors. One of the selling points of enterprise drives is the very short read timeout for failed sectors. Using an enterprise drive can mitigate your issues to some degree, but will not solve them.

The best answer, moving forward, is to maintain proper backups so that recovery isn't required. Changing recovery software will not make a difference as this is a firmware timeout issue.

John Pace II

Posted 2015-08-09T12:35:46.990

Reputation: 39

2

Why do damaged hard drives freeze entire system?

They don't have to (in general). It's really depending on the particular file system how a disk failure is dealt with.

Consider ZFS, which is designed from the ground up to deal with quite some fault tolerance. Here's a demo video (and one with more explaining) where they place running drives on an anvil, take a swing with a sledge hammer and drill another drive. All while ZFS keeps running.

Jens

Posted 2015-08-09T12:35:46.990

Reputation: 507

2Actually, there are disk failures that ZFS doesn't deal well with. For example, extremely long reads before the I/O request times out, in redundant or non-redundant setups. (You can just as easily set up ZFS in such a way that it has no redundancy.) This can easily lead to drives being thrown out of the array in ZFS, which if this drops you below the redundancy threshold can cause the whole array to become unavailable. If set with failmode=wait, this can show similar results. All-out full-disk failure is the easy case for any storage subsystem; it's marginal drives that pose problems. – a CVn – 2015-08-11T21:04:52.900

And before you think otherwise, I actually run ZFS (almost exclusively) myself. It's a great file system, and a wonderful volume manager, if you are careful and know what you are doing. However, it's designed for enterprise class systems (high-end workstations and servers), with administrators paid to know what they are doing. It isn't designed to deal well with some failure modes seen in commodity hardware, including RAM problems and drives that take excessively long to return from an I/O request, and it isn't designed for ease of use for home users or in home user use-cases. – a CVn – 2015-08-11T21:08:00.093

Except in the video, ZFS does not keep running. It starts running again after disconnecting the drive. – Christoffer Hammarström – 2015-08-13T10:37:42.013

-2

I think the problem you are encountering is a low-level part of the OS tries numerous times to read bad blocks before giving up. This routine is implemented at a low-level in case it is needed during booting or other standalone operation, and hence it is difficult to make it re-entrant. The operating system will page continually during normal operation and it is difficult to give a priority to competing requests because the low-level system will not know the priority of the process that owns a paging request.

jrrk

Posted 2015-08-09T12:35:46.990

Reputation: 1

6The 'low-level system' does know the priority of a process that is requesting a page; such information is held in page tables, though the implementation is system-dependent on how priority is handled. This is not the correct answer to the question though - this is a hardware issue, not an OS issue. – Chris Cirefice – 2015-08-10T15:41:32.540

1I think the correct answer to the question is to refuse to use a faulty drive. However this would not satisfy users who understandably want to recover as much data as possible. – jrrk – 2015-08-10T16:13:27.627