107

Prelude:

I'm a code-monkey that's increasingly taken on SysAdmin duties for my small company. My code is our product, and increasingly we provide the same app as SaaS.

About 18 months ago I moved our servers from a premium hosting centric vendor to a barebones rack pusher in a tier IV data center. (Literally across the street.) This ment doing much more ourselves--things like networking, storage and monitoring.

As part the big move, to replace our leased direct attached storage from the hosting company, I built a 9TB two-node NAS based on SuperMicro chassises, 3ware RAID cards, Ubuntu 10.04, two dozen SATA disks, DRBD and . It's all lovingly documented in three blog posts: Building up & testing a new 9TB SATA RAID10 NFSv4 NAS: Part I, Part II and Part III.

We also setup a Cacti monitoring system. Recently we've been adding more and more data points, like SMART values.

I could not have done all this without the awesome boffins at ServerFault. It's been a fun and educational experience. My boss is happy (we saved bucket loads of $$$), our customers are happy (storage costs are down), I'm happy (fun, fun, fun).

Until yesterday.

Outage & Recovery:

Some time after lunch we started getting reports of sluggish performance from our application, an on-demand streaming media CMS. About the same time our Cacti monitoring system sent a blizzard of emails. One of the more telling alerts was a graph of iostat await.

enter image description here

Performance became so degraded that Pingdom began sending "server down" notifications. The overall load was moderate, there was not traffic spike.

After logging onto the application servers, NFS clients of the NAS, I confirmed that just about everything was experiencing highly intermittent and insanely long IO wait times. And once I hopped onto the primary NAS node itself, the same delays were evident when trying to navigate the problem array's file system.

Time to fail over, that went well. Within 20 minuts everything was confirmed to be back up and running perfectly.

Post-Mortem:

After any and all system failures I perform a post-mortem to determine the cause of the failure. First thing I did was ssh back into the box and start reviewing logs. It was offline, completely. Time for a trip to the data center. Hardware reset, backup an and running.

In /var/syslog I found this scary looking entry:

Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_00], 6 Currently unreadable (pending) sectors
Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_07], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 171 to 170
Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 16 Currently unreadable (pending) sectors
Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 4 Offline uncorrectable sectors
Nov 15 06:49:45 umbilo smartd[2827]: Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Nov 15 06:49:45 umbilo smartd[2827]: # 1  Short offline       Completed: read failure       90%      6576         3421766910
Nov 15 06:49:45 umbilo smartd[2827]: # 2  Short offline       Completed: read failure       90%      6087         3421766910
Nov 15 06:49:45 umbilo smartd[2827]: # 3  Short offline       Completed: read failure       10%      5901         656821791
Nov 15 06:49:45 umbilo smartd[2827]: # 4  Short offline       Completed: read failure       90%      5818         651637856
Nov 15 06:49:45 umbilo smartd[2827]:

So I went to check the Cacti graphs for the disks in the array. Here we see that, yes, disk 7 is slipping away just like syslog says it is. But we also see that disk 8's SMART Read Erros are fluctuating.

enter image description here

There are no messages about disk 8 in syslog. More interesting is that the fluctuating values for disk 8 directly correlate to the high IO wait times! My interpretation is that:

  • Disk 8 is experiencing an odd hardware fault that results in intermittent long operation times.
  • Somehow this fault condition on the disk is locking up the entire array

Maybe there is a more accurate or correct description, but the net result has been that the one disk is impacting the performance of the whole array.

The Question(s)

  • How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt?
  • Am I being naïve to think that the RAID card should have dealt with this?
  • How can I prevent a single misbehaving disk from impacting the entire array?
  • Am I missing something?
Brian Redbeard
  • 349
  • 3
  • 12
Stu Thompson
  • 3,339
  • 6
  • 30
  • 47
  • 11
    Another well-written question from you, +1. Always a pleasure to read (but unfortunately above my board to even have an idea about). – tombull89 Nov 16 '11 at 11:36
  • I have seen bizarre things like that. Had at least two cases where a faulty SATA harddisk prevented the system from even completing the POST stage when starting up, making it simply hang. So I don't find it inconceivable that one SATA disk (on a 3ware RAID controller) could cause all sorts of problems. An answer would obviously be to invest in "real" server hardware, if possible: SAS disks and a RAID controller worth the name. There most often is a reason for server hardware. Just the other day we retired three Dell servers at a customer site that ran 24/7 for 9 years without a single problem. – daff Nov 16 '11 at 12:41
  • 1
    @daff: Buy going budget on this setup we saved a solid 66% from a comparable from HP. We put a thee year life span on this box, it doesn't need to last longer. Remember that this is a storage box, costs plumet year-on-year. – Stu Thompson Nov 16 '11 at 15:39
  • 2
    3Ware isn't bad, per se. I've had wonky behavior from a PERC card on a Dell system, which is supposed to be decent server hardware. The 3Ware card should have onboard battery and such, so I wouldn't feel too bad about the decision. Okay, you might get slammed for the SAS vs. SATA decision, but you aren't losing data and from your question you sound like you have backups and monitoring in place, so you're doing pretty good :-) – Bart Silverstrim Nov 16 '11 at 15:52
  • 1
    @StuThompson: of course it is cheaper to go budget and use consumer hardware, and most often it will perform fine, especially when, as in your case, there is a good HA concept behind it. But there are cases, as you have shown, where consumer hardware just doesn't cut it when bad things happen. I can pretty much guarantee you that a single faulty SAS disk on a good PERC (Dell) or SmartArray (HP) controller would not have caused you any issue other than a support call to get a replacement disk. We've had plenty of dead SAS disks over the years in production but never had them take a server down. – daff Nov 18 '11 at 16:26
  • 6
    Most SATA disks do not support TLER (Time Limited Error Recovery). When a typical SATA disk encounters a physical problem it sends a "hold on while I work on this" to the disk subsystem (which usually does as it's told). The disk will then proceed to spend 10-30 seconds (usually) on each error it finds until it hits an "I'm dead" threshold. SAS disks and SATA disks that support TLER are configured by their HBA to tell the disk subsystem "I've got a problem, what should I do?" so the HBA can decide the appropriate action basically immediately. (Simplified for brevity) – Chris S Nov 26 '11 at 02:34
  • 1
    @ChrisS: The drives I selected for the RAID setup, WD RE4 (WD2003FYYS), do support TLER. – Stu Thompson Nov 26 '11 at 07:17
  • 1
    Curious: You mention DRBD. Were the services connected to two different NAS units? Would a DRBD detach / disconnect have been successful? Maybe some way to connect performance monitoring with DRBD management would be nice. – korkman Feb 22 '12 at 16:30
  • @korkman ...Which is what effectively occurred manually. Automating such behavior is opening up a whole new can of worms. The time/effort costs would start to outweigh getting more 'correct' hardware, like SAS. – Stu Thompson Sep 20 '13 at 02:35

8 Answers8

50

I hate to say "don't use SATA" in critical production environments, but I've seen this situation quite often. SATA drives are not generally meant for the duty cycle you describe, although you did spec drives specifically rated for 24x7 operation in your setup. My experience has been that SATA drives can fail in unpredictable ways, often times affecting the entire storage array, even when using RAID 1+0, as you've done. Sometimes the drives fail in a manner that can stall the entire bus. One thing to note is whether you're using SAS expanders in your setup. That can make a difference in how the remaining disks are impacted by a drive failure.

But it may have made more sense to go with midline/nearline (7200 RPM) SAS drives versus SATA. There's a small price premium over SATA, but the drives will operate/fail more predictably. The error-correction and reporting in the SAS interface/protocol is more robust than the SATA set. So even with drives whose mechanics are the same, the SAS protocol difference may have prevented the pain you experienced during your drive failure.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • As I was writing the question I just *knew* my choice of SAS was going to come up. :/ The IOPS and throughput are well within the capabilities of my setup. But I did not fully grok some of the more subtle differences. We put a 3-year lifespan on this box. Will be sure to use SAS next time around. – Stu Thompson Nov 16 '11 at 14:56
  • 1
    Yes, it's something to consider the next time. The nearline SAS drives I mentioned don't necessarily perform better than SATA, but it's things like error recovery and drive failures where the SAS is more manageable. I have a Sun Fire x4540 48-drive SATA storage system with 6 controllers, and individual drive failures tended to lock the server. Hard lesson. – ewwhite Nov 16 '11 at 15:34
  • 10
    A good buddy of mine is in the enterprise storage world. He read all this and says *"this guy is right. what happens is that SATA is designed to denote a complete failure and an intermittent one will requery the bus w/o enacting failover. typically this is never seen since most SATA configs are one drive"* – Stu Thompson Nov 16 '11 at 18:10
  • @StuThompson Have you since built a new box with near-line SAS? I'd love to read about your experiences. Your question has helped me a lot already, I will likely be building a similar box in the near future. – chrishiestand Sep 19 '13 at 21:48
  • 1
    @chrishiestand No, I haven't. I left company in Jan 13; if I had stayed we would have built the replacement box with near line. Alas, the NAS's existence was too closely tied to my own and the data was moved to a service provider's SAN. – Stu Thompson Sep 20 '13 at 02:26
  • If you have to use SATA, choose from tested drives such as ReadyNAS[http://kb.netgear.com/app/answers/detail/a_id/20641] HCL and BackBlaze[https://www.backblaze.com/blog/best-hard-drive/] hardware lists. Then search for those drives at FreeNAS.org. Most importantly, stress test the drives extensively before production. Good choice in choosing Raid Edition drives. – rjt Mar 19 '15 at 04:07
  • Superbly good answer. I give major kudos to you. – rockower Nov 23 '18 at 17:01
16

How can a single disk bring down the array? The answer is that it shouldn't, but it kind of depends on what is causing the outage. If the disk were to die in a way that behaved, it shouldn't take it down. But it's possible that it's failing in an "edge case" way that the controller can't handle.

Are you naive to think this shouldn't happen? No, I don't think so. A hardware RAID card like that should have handled most issues.

How to prevent it? You can't anticipate weird edge cases like this. This is part of being a sysadmin...but you can work on recovery procedures to keep it from impacting your business. The only way to try to fix this right now is to either try another hardware card (not probably what you'd want to do) or change your drives to SAS drives instead of SATA to see if SAS is more robust. You can also contact your vendor of the RAID card and tell them what has happened and see what they say; they are, after all, a company that is supposed to specialize in knowing the ins and outs of wonky drive electronics. They may have more technical advice on how the drives work as well as reliability...if you can get to the right people to talk to.

Have you missed something? If you want to verify that the drive is having an edge-case failure, pull it from the array. The array will be degraded but you shouldn't have more of the weird slowdowns and errors (aside from the degraded array status). You're saying that right now it seems to be working fine, but if it's having disk read errors, you should replace the drive while you can. Drives with high capacity can sometimes have URE errors (best reason not to run RAID 5, side note) that don't show up until another drive has failed. And if you're experiencing edge-case behavior from that one drive, you don't want corrupted data migrated to the other drives in the array.

Bart Silverstrim
  • 31,092
  • 9
  • 65
  • 87
  • 1
    Yeah...we've already put in a new replacement policy like *"if the read errors fluctuate then yank it"*. Now that I think about it, we've had a fairly high rate of failure on these drives. 4 of 22 in 18 months. Hmmm.... – Stu Thompson Nov 16 '11 at 15:00
  • 2
    4 drives in 18 months? that's quite a rate there...while it could be the drives not being in spec, there could be a cooling/airflow issue too to look at. Or possibly something strange with the controller. Just some thoughts...keep an eye on the logs. If you're able to contact anyone in 3Ware with actual work on the cards and not just a script, you might want to run it by them and see what they say. – Bart Silverstrim Nov 16 '11 at 15:19
  • 1
    Depending on the set where you're seeing the errors, you could also check that there isn't something wonky or marginal with the cables too. If the errors seem to be concentrated on the same port, you might have less than a coincidental set of failures. – Bart Silverstrim Nov 16 '11 at 15:20
  • When I RMA'd the first failed drive some moths ago, the vendor commented that he's seen quite a few of this modle/batch RMA'd. It made me nervous at the time and stuck with me. – Stu Thompson Nov 16 '11 at 15:27
  • There are two spare ports on the box. It might be wise to put the replacement into a free slot and leave port 8 along. But then I won't know if it was the port or the drive...Hmmm...probably will error on the side of caution. Fail-overs during biz hours suck, to be avoided. – Stu Thompson Nov 16 '11 at 15:29
  • 4
    I've just seen that the SMART values for this bum drive was running at ~31°C, or a good 4°C higher than all the other drives. *Things that make you go hmmmm....* – Stu Thompson Nov 16 '11 at 17:41
  • More active than the others, or possible airflow issue, if more active could be due to resets and activity. Bad drive. Finish killing it with fire. – Bart Silverstrim Nov 16 '11 at 17:45
  • 1
    How tight is the tolerance on the temperature sensors? Could it just have been poorly calibrated? – Dan Is Fiddling By Firelight Nov 16 '11 at 17:59
  • 2
    @DanNeely: Out of 14 drives (11 data, 3 system) it was the only one with a higher temp. I fairly certain the airflow was good, but will explicitly check tomorrow. – Stu Thompson Nov 16 '11 at 18:07
10

I'm not an expert, but I'm going to take a wild shot in the dark on the basis of my experience with RAID controllers and storage arrays.

Disks fail in many different ways. Unfortunately, disks can fail, or be faulty, in ways where their performance is seriously affected but the RAID controller doesn't see as being a failure.

If a disk fails in an obvious way, any RAID controller software should be pretty good at detecting lack of response from the disk, removing it from the pool and firing any notifications. However, my guess as to what's happening here is that the disk is suffering an unusual failure which, for some reason are not triggering a failure on the controller side. Therefore when the controller is conducting a write flush or a read from the affected disk, it's taking a long time to come back and in turn is hanging the whole IO operating and therefore the array. For whatever reason, this isn't enough for the RAID controller to go "ah, failed disk", probably because the data ends up coming back eventually.

My advice would be to immediately replace the failed disk. After that, I'd take a look at the configuration for your RAID card (It's 3ware, I thought they were pretty good) and find out what it considers a failed disk to be.

P.S. nice idea importing SMART into cacti.

growse
  • 7,830
  • 11
  • 72
  • 114
  • Once I connected the dots, the first think I did was to remove the disk from the array; the hot spare filled in. That was last night. Today I pulled the disk and RMA'd it. The offending drive: http://geekomatic.ch/images/wd-re4-flux-read-error.jpg – Stu Thompson Nov 16 '11 at 15:19
  • One of the reasons I think every mission critical system needs to have a card that does data scrubbing. I've seen this too many times to count, especially on SATA arrays, however, even higher end SAS disks have been known to fail without triggering the controller. – Jens Ehrich Dec 09 '16 at 17:35
6

Just a guess: the harddisks are configured to retry on read errors rather than report an error. While this is desirable behaviour in a desktop setting, it is counterproductive in a RAID (where the controller should rewrite any sector that fails reading from the other disks, so the drive can remap it).

Simon Richter
  • 3,209
  • 17
  • 17
6

my shot in the dark:

  • drive 7 is failing. it has some failure windows where it's not available.

  • drive 8 has some 'lighter' errors too; corrected by retrying.

  • RAID10 is usually "a RAID0 of several RAID1 pairs", are drive 7 and 8 members of the same pair?

if so, then it seems you hit the "shouldn't happen" case of two-disk failure on the same pair. almost the only thing that can kill a RAID10. unfortunately, it can happen if all your drives are from the same shipping lot, so they're slightly more likely to die simultaneously.

I guess that during a drive 7 failure, the controller redirected all reads to drive 8, so any error-retry caused big delays that caused an avalanche of frozen tasks, killing performance for a while.

you're lucky that drive 8 doesn't seem to be dead yet, so you should be able to fix without dataloss.

I'd start by changing both drives, and don't forget to check cabling. a loose connection could cause this, and if not routed firmly, it's more likely to happen in adjacent drives. also, some multiport cards have several two-port connectors, if drive 7 and drive 8 are on the same one, it might be the source of your trouble.

Javier
  • 9,078
  • 2
  • 23
  • 24
  • 3
    Drive 8 is what cause the service interruption, I've already pulled it. Drive 7, while it has lost some sektors, as been in this state for a while and is still generally performing well. No, they drives are in different pairs. *(It was something I considered, along with a possible misalignment of my Cacti/SNMP queries.)* The card has 16 ports, 4 cables, 4 ports per cable into a back pane. If the issue is the card, cable or backpane I'll known soon enough when I insert drive 8's replacement. – Stu Thompson Nov 16 '11 at 15:23
6

You need the features of enterprise class storage devices. Specifically, the WD RE 4 enterprise drives have two features needed to prevent this behavior in RAID Arrays. The first technology listed below prevents rotational harmonic vibration from causing needless wear on the hard drive mechanical components. The second technology is what caused your problem, SATA protocol does not have this feature. To get these features you need SAS, and if you insist on SATA drives you can purchase SAS to SATA Interposer cards such as the LSISS9252.

Enhanced RAFF technology Sophisticated electronics monitor the drive and correct both linear and rotational vibration in real time. The result is a significant performance improvement in high vibration environments over the previous generation of drives.

RAID-specific, time-limited error recovery (TLER) Prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.

http://en.wikipedia.org/wiki/Error_recovery_control#Overview

Also please see link below:

http://en.wikipedia.org/wiki/Error_recovery_control#Raid_Controllers

Also see: Western Digital TLER Document explaining the error recovery process in depth. Error Recovery Fallout Prevention in WD Caviar RAID Edition Serial ATA Hard Drives:

http://www.3dfxzone.it/public/files/2579-001098.pdf

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
Loose Cannon
  • 111
  • 2
  • 4
3

SATA Interposer Cards are another solution.

I recently experienced exacly the same fate and found this thread. The overall tenor is that SAS protocol is better suited for RAID than SATA, because SATA is lacking features. This is why the same physical drives are equipped with SAS controllers, then sold as Nearline SAS.

Searching further, I found:

http://www.lsi.com/products/storagecomponents/Pages/LSISS9252.aspx

I'm investigating upgrading one of my storages with a batch of these. Right now, the price difference between 3 TB SATA vs SAS is 400% (vanilla price, same brand, specs and shop, Germany). I obviously can't tell if this strategy works out well, but it's worth a try.

Comments very welcome :-)

korkman
  • 1,647
  • 2
  • 13
  • 26
  • 1
    Well nice theory. After gathering some information, only storage tray manufacturers can integrate these boards and adding them doesn't necessarily mean better error handling. – korkman Mar 02 '12 at 13:28
2

I have seen a SATA disk with broken electronics lock up the firmware init of an Areca 12something solidly, there was no way to access the BIOS let alone boot the machine from any medium until the offending hard drive was found by pulling disks out in a binary search fashion.

rackandboneman
  • 2,487
  • 10
  • 8