14

In our shop we're faithfully using RAID in all our workstations, probably just because that seems to be the way it ought to be done. I'm talking about workstations for scientific simulations, using the onboard RAID chips.

But I've heard a lot of RAID horror stories. Stackoverflow itself has had an outage caused indirectly by RAID controller.

RAID protects you against a very narrow type of failure - physical disk failure - but at the same time it also introduces extra points of failure. There can be problems with the RAID controller, and there often are. In our shop at least, it seems that RAID controllers fail at least as often as disks themselves. You can also easily mess something up with the process of swapping a faulty drive.

When is RAID worth the trouble? Don't you get a better return on investment by adding more redundancy to your backup solutions? Which type of RAID is better or worse in this regard?

Edit: I've changed the title from the original "Is RAID worth the trouble?", so it sounds less negative

Chealion
  • 5,713
  • 27
  • 29
amarillion
  • 1,409
  • 2
  • 16
  • 25
  • 3
    When you say using RAID on workstations, I'm wondering what you mean by RAID. The RAID that ships as part of a desktop class motherboard's chipset is not really RAID. Real RAID is an expensive (several hundreds, maybe thousands of dollars) option, usually implemented as as PCI card of some type. Think Adaptec or LSI, not Promise. – Jason Tan May 30 '09 at 18:53
  • 1
    You're right, we're using some on-board chipset solution. So perhaps my question should be modified a little: is cheap RAID worth the trouble? – amarillion May 30 '09 at 19:51
  • See Also: [What are the widely-used RAID levels, and when should I consider them](http://serverfault.com/questions/339128/what-are-the-different-widely-used-raid-levels-and-when-should-i-consider-them) – voretaq7 Dec 16 '12 at 18:45

21 Answers21

17

Don't worry, RAID isn't used throughout the business world because of groupthink! The chance of decent RAID controllers failing is far, far lower than the chance of a disk failure. I don't recall ever seeing a RAID controller fail in real life, while I've seen many a disk die, both in the office and datacenter.

PS: I see your tags. RAID is not backup! :)

Alex J
  • 2,804
  • 2
  • 21
  • 24
  • 1
    Right, it's not backup. So then it's redundancy? So it's really all about high up-times? Unless you need five nines, you don't really need RAID? – amarillion May 30 '09 at 19:47
  • 6
    No, it's about availability. Taking down the machine when you want to is fine. Having a single hard drive decide to take down your machine isn't. Using RAID properly prevents that from happening. – Matt Simmons May 30 '09 at 20:13
  • 9
    @amarillion. Wow, that's a dangerous sentiment. How much experience with hard-drives do you have? RAID is pretty much required for even *2* nines of reliability (more so the more hard-drives are in the mix), and RAID alone definitely will not get you to 5 nines, you'll need redundant datacenters for that, at least. Even then it's a crapshoot, 5 nines is management fantasy land BS, that's less than an hour of downtime per decade (~5 min/year). Not even IP backbones have that. – Wedge May 30 '09 at 20:32
  • Alright, I was just exaggerating with the 5 nines. My point is, In our case we're probably using RAID for the wrong reason. – amarillion May 30 '09 at 21:03
  • 4
    @amarillion: Some of my customers have developers on site billing $200/hr. Or workers responding to life or death situations. Disrupting those workers for wont of an $80 hard disk seems kinda dumb to me, YMMV. – duffbeer703 May 31 '09 at 01:23
  • What are you talking about. I have a RAID linux machine that IS my BACKUP for my Mac. I think you're getting a little too 'clever' with your technicalities Alex. – Brock Woolf May 31 '09 at 01:49
  • 3
    No. RAID protects you from hard drive failure. It does not protect you from 'rm -rf /'. THAT is what backups are for! – Alex J May 31 '09 at 03:48
9

ZFS by SUN (also part of OpenSolaris; Apples OSX - currently read only) not only does raid with various levels but always check to see if the data written to disk is actually there. consistency is key! RAID is useless if you can´t rely on its integrity. Pick a decent RAID controller (I prefer HP´s) and scrub your RAID to find errors periodically.

Softwareraid (as ZFS) on the other hand amkes you more hardware independant if the RAID controller dies and you can´t get an exact replacement.

lepole
  • 1,723
  • 1
  • 10
  • 17
8

Always. Disks are cheap, your information is not. But use software RAID, so you have the flexibility to move forward or change hardware later on (trust me, you will need it). And also use a checksumming filesystem like ZFS, to protect against silent data corruption (which is very likely with large disks nowadays).

Rudd-O
  • 91
  • 1
8

For those of you saying you won't use hardware RAID because if the controller fails and you can't get an identcial replacement your screwed, you're going about it the wrong way.

  1. If uptime is that critical to you, you should NOT be buying cheap hardware. As was said before, use a good raid controller, HP, LSI, Dell etc.

  2. If the controller was purchased from the computer manufacturer, ie Dell server, with Dell RAID controller, Dell will tell you how long they will be stocking those parts, usually this in the in the 4+ year from the EOL of that server.

  3. If having someone running again quickly means you cannot wait for the delivery then you should be purchasing a second spare controller for yourself, regardless of who made it.

  4. If you setup as a RAID 1, you can sometimes take that one of those drives and drop them on a normal controller to recover the data. If that is important to you, confirm/test this with your controller before you are in a critial situation.

Hardware RAID saved my butt 2x. Once in an email server one of the drives failed, I got the email alert from the raid monitoring software on that machine, called up dell and had a new drive the next day, poped it in and it rebuild all on its own. ZERO downtime on that one

Second one, had a drive fail in an old file server that was scheduled for replacement in 6 months. The controller kept it running and we moved the replacement of the server up to that week. Saved buying a new drive (since it was out of warrenty) and again ZERO downtime.

I've used software raids before and they just don't recover as nicely as hardware based one. You have to test your setup, software or hardware to be sure it works and know what to do when the brown stuff hits the fan.

LEAT
  • 217
  • 1
  • 2
  • 3
    People tend to look at RAID as a type of insurance. If they don't get an "accident", then the benefits of RAID (insurance) don't ever seem apparent. Thanks for sharing your story as many people (I think) take RAID lightly because if they never have a bad experience, why invest in something that may not happen? This should be a lesson for everyone who's reading: a solid, hardware RAID controller will save your ass in that one in a million/billion chances. Don't leave it to chance; always use a good hardware RAID controller especially for servers. – osij2is Jul 24 '09 at 16:09
6

Harddrive failures are much more likely to happen in a server than a desktop workstation...

You can't just say "adding more points of failure" without taking into account the likelihood of that failure. Especially since these less likely points of failure are specifically in place to subvert the more likely hard disk drive crash. As you've put it, you've basically created a Pascal's Wager-like fallacy.

Most RAID systems on desktop motherboards are cheapo software/hardware hybrids with most of the work done in its software driver. IMHO they are peices of crap used to sell to power-users.

On the other hand, a good actual hardware RAID is quite reliable, and it has the hardware to do its thing without (despite?) the operating system. But those get expensive, because real hardware usually has battery backups, and a complete XOR'ing array to calculate checksums, etc. Even more expensive if it's done using SCSI.

Summary: If you are running the motherboard based RAID systems, then no, it isn't worth the trouble.

Peter Mortensen
  • 2,319
  • 5
  • 23
  • 24
Ape-inago
  • 271
  • 1
  • 3
  • 8
  • 3
    A colleague runs a large school IT environment with 180,000 workstations with a top-notch helpdesk. 7% of their desktops require a hardware replacement within their 5 year lifecycle, and 85% of those replacements are hard disks. – duffbeer703 May 31 '09 at 01:32
  • Yeah, but if a workstation goes down, you just have the user log into another machine while you are fixing the broken one. With that many workstations, their aught to be a central file repo. I wonder what the statistic would look like with 180,000 servers. – Ape-inago May 31 '09 at 02:40
  • 1
    You're right for many circumstances -- but not for everyone. In my friend's scenario, many of those PCs are in the back of classrooms, and if they are broken, that class doesn't have a computer and its a big deal. At my job, we have spare workstations and don't really care. – duffbeer703 May 31 '09 at 16:11
5

Although backups and RAID are solutions to different problems, most "RAID problems" are very similar to the most common backup problem (ie. nobody tests a restore) -- nobody tests system recovery. Other RAID problems are often a direct result of people not understanding what it does and doesn't do. For example, many people think that RAID guarantees the integrity of their data -- it does not.

For workstations, if you're using RAID-0 to improve performance of IO-bound applications, or RAID-1/5/6 to keep to $100/hour scientist working when her $80 hard disk fails, you're using RAID appropriately. Just don't confuse disk redundancy with backup, and have tested procedures in place to ensure that your IT guys handle recovery.

duffbeer703
  • 20,077
  • 4
  • 30
  • 39
  • Good note for workstations. Workstation needs are completely different than server needs. And an *emphatic* yes on "..don't confuse disk redundancy with backup". – osij2is Jul 24 '09 at 16:11
4

There are two types of RAID

  • One that is cheap integrated. This is NOT a real raid the real work is done by the software (special driver the does the raid computations). You should avoid this one.
  • The other one is expensive, but what you get is real raid. If you can afford this it worth the money.

Some operating systems has good software raid solution (this has nothing to do with the crappy cards mentioned above). Linux software raid is especially good, its performance is really good.

Raid can only improve reliability it is not a backup solution. Files can be deleted accidentally, faulty disk can return (and duplicate) bad data to other disks in a raid array, so a real backup solution still needed.

cstamas
  • 6,607
  • 24
  • 42
4

RAID is great for uptime, but it's not a substitute for backup. As a colleague once commented, "You know that 'Oh, sh!t' moment when you deleted something accidentally? RAID just means you get to 'Oh, sh!t' more than one drive at the same time."

That said, that day when you pop your head into your boss's office and tell her, "By the way, the database server had a hard drive crash last night-- we never went down, it finished rebuilding onto the spare at 5 AM and I've sent the bad drive off under warranty" -- that's when RAID is priceless.

user6622
  • 186
  • 1
  • 1
2

What is your failure rate on hard disks and raid controllers? Failure on the raid controller should be far lower than the disks. If you have a high failure rate you may want to look at your environment such as static discharges that could be causing issues.

For workstations you may want to use software raid as suggested by Alakdae because you won't have to worry about maintaining stocks of the precise hardware controller. However you should have all vital information stored on your servers which do have hardware raid and are backed up to different media.

Server hardware manufacturers do maintain raid controllers so even if it's an older controller you can usually still get it from them if you need to (it'll cost you a pretty penny though).

David Yu
  • 1,032
  • 7
  • 14
2

Linux software RAID is excellent, and it actually beats low-end hardware RAID hands down. It also has a few optimizations that can be useful for a workstation. For example, it can read different things on each disk at the same time, effectively doubling random access read times, which is a common use case unlike transfer rate-bound operations optimized by RAID 0.

As for reliability, it's a very well maintained part of the Linux kernel, used by millions, it handles hardware failures very well, so it's clearly a win as far as availability is concerned. I have used it on my personal workstations as well as a few dozen low-end servers for years, some pretty loaded, and never could attribute it any fault. I've experienced a good dozen broken disks in the meantime, however.

(Higher end hardware RAID cards have other features though, such as battery-backed write cache. It basically multiplies random synchronized disk write speed by ten. It is absolutely necessary for databases, probably pretty useless for workstations.)

Peter Mortensen
  • 2,319
  • 5
  • 23
  • 24
niXar
  • 2,023
  • 17
  • 23
2

It seems that a lot of the above posts are forgetting the original question and are just debating about RAID 1. The question was "When is RAID worth the trouble?" Well, it depends... If your developers do a lot of data read & writes with their workstations than a RAID 0 configuration would be worth it. Adding more drives to this RAID 0 is of course going to boost speed and performance BUT will increase the likelihood of a failure (disk or controller).

I work for a Nursing School with about 500 Dell machines deployed and almost none of them utilize any sort of RAID. It seems to me that my type of users won't see enough of a benefit to add the complexity of a RAID system on each machine. I worry more about data recovery and disk imaging than the speed of RAID 0 or the redundancy of RAID 1. Of course, I'm not talking about our production servers, that's another story. Data recovery being crucial, we rely on other backup methods to account for more than just disk redundancy. Any sort of RAID wont help you if a user accidentally deletes a file.

So to answer your question IMHO... RAID 0 on a workstation is worth it when the user needs the performance. (Just make sure that all importa data is backed up.) I'm sure you can check into the data throughput on the existing setup to see if it's adequate. RAID 1 should be used in the server environment where higher class RAID controllers are available. It's not worth the hassel on a workstation because it complicates deployment, disk imaging, and repairs. Many of these workstations come with RAID controllers built on the motherboard.It's a good feeling to know if a motherboard goes out on a machine I can always put the drive in another system to get the data.

Treybus
  • 31
  • 1
1

I just had the RAID controllers in two (identical) servers fail, since we got those two machines we didn't have one hard disk failure in the entire company.

I think RAID on desktop is a bad idea, the cheap RAID controllers you're going to put on those machines will fail long before the actual hard drive.

On servers, maybe, I'm not going to trust RAID controllers again, make sure you have a spare machine and good backups.

Nir
  • 121
  • 6
  • 12
1

I am a developer and all our workstations use RAID for the internal drives. RAID 0. This is definetly worth it. You never want to go back to compiling from a single 7200RPM drive once you have tried a pair of 15000s.
I have been challenged on if it is the RAID or the 15k drive that is making compile times shorter. I don't know, for compile a single fast drive may give exactly the same performance. However, a single SAS drive is not particularly large for a modern pc, so in-expensive on board RAID still has a place. That and I doubt RAID is ever going to hurt the performance of the system.
I think this sort of RAID is certainly appropriate for a workstation and is probably best done using the inexpensive on-board controllers. From the server side, most of our servers have some form of RAID array for the OS disk and data is then on a seperate array of some appropriate form. I don't know about our production servers but our dev servers (of which we have a fair amount) have never had a controller fail, we have had drives fail though. In one case we had half of the OS array fail on a SQL box, while it was re-building, the other disc failed! Sometimes RAID1 just ain't enough!

pipTheGeek
  • 1,152
  • 5
  • 7
  • 1
    I have to call BS on this one. RAID 0 is useless for a developer workstation. RAID 0 at best doubles transfer rates; it does nothing for random access. Guess what developers do ... read and write lots of tiny files, and the occasional large-ish one. The only workstation it would be useful would be that of a graphic designer doing video editing, where you need all the GB/s you can get. – niXar May 31 '09 at 12:35
  • This may be true, I haven't compared the performance of a single 15k sas drive to that of the dual drive raid 0. I have updated my answer. – pipTheGeek May 31 '09 at 15:41
  • 1
    It depends on what your developers do. We have guys that work with big datasets who notice a significant performance improvement, especially during compiles. GIS guys notice an improvment with RAID 0 too. – duffbeer703 May 31 '09 at 16:06
  • Going from a 7.2k to a 15k drive would mean a substantial speedup. There's not a lot more to be gained from Raid 0. – Loren Pechtel Sep 17 '09 at 01:34
  • Surely a single SSD would be cheaper and faster nowdays? – Dentrasi Nov 10 '10 at 01:32
1

For your scientific workstations it may well be worth it IF those systems work better with their data stored locally, as opposed to a share on a file server. For the general populace however I'd say no. It's not worth the hassle and headache when all you really need is to restore data that should be kept on shares.

Shawn Anderson
  • 542
  • 7
  • 14
1

RAID is only useful when you absolutely positively can't have the server go down unexpectedly. We use RAID on all our servers in our datacentre where there isn't some other form of redundancy. For example we don't use RAID on our webservers, because there's another 10 still working.

The litmus test is "if a disk breaks in the middle of the night and it can't wait until 9am, it needs RAID"

David Pashley
  • 23,151
  • 2
  • 41
  • 71
  • There are other contexts where it makes sense - like if you don't have an quick & easy way to restore the machine to its former state. – cp.engr Jan 08 '17 at 04:53
1

RAID is worth the trouble when you have a battery-backed controller.

For server applications which frequently fdatasync() log files (which is not uncommon in databases) for durability, you'll end up writing the same blocks over and over again. This will kill IO performance if you don't have a battery backed controller.

If you DO have a battery backed controller, many of the writes won't even reach the discs, instead just staying in memory until they're replaced by another write. This is a Good Thing.

The redundancy is a bonus but not essential, as important things should be redundant at a system level.

MarkR
  • 2,898
  • 16
  • 13
1

Cheap RAID implementations are terrible.

Your choices are, in order of reliability:

1) HP DL servers with their hardware RAID.
2) 3Ware RAID cards.
3) ZFS
4) Linux Software Raid

Anything else is asking for trouble, and indeed may result in lower overall reliability than a non-RAID solution.

Consider what to do if your controller fails and the manufacturer is out of business.

Consider whether you can recover from an apparent double-disk failure caused by power/cabling issues.

Those are two examples among hundreds.

carlito
  • 2,489
  • 18
  • 12
1

For workstations RAID is probably not worth it compared to having a new system on which data can be restored...

Many were talking about RAID 0...that's not there to help availability. You're doubling the chances of the volume failing, since once one drive dies you lose the whole thing. RAID 0 is just about playing with speed of access to reads/writes on a volume and giving more storage. The only way this could help in a business environment is to take two RAID 0's and mirror them as RAID 1.

RAID is not a backup solution, as has been pointed out.

RAID is also not perfect. I think this post from this guy's blog kind of sums up how I feel about RAID and when it's worth it: Thinking of RAID?

On a workstation you should be able to get one person to use another system while a replacement is rolled out. Why use RAID? His or her data should be stored on the server where management, data integrity and backups are centralized. The workstation should be configured so that it can be periodically upgraded or altered as finances allow and the RAID is just another layer of cost and headache to manage (plus power use and heating issues with added drives and airflow imposition). In the majority of cases for businesses it's probably far more cost effective to put the money from a RAID card into a bigger drive, and if you're using onboard RAID then you're still going to have issues since it tends to tie the RAID format to the motherboard (and it's not true RAID anyway...it's found in Google searches as "fake raid".) Unless you get a very similar motherboard to replace one when it goes bad you may not be able to get back into your RAID volume!

Bart Silverstrim
  • 31,092
  • 9
  • 65
  • 87
0

Why bother on a workstation? Surely you have all your home directories and data stored centrally. That is where you want to use raid.

goo
  • 2,838
  • 18
  • 15
0

If you worry about a drive controller failing, then you also need to consider the server failing - fans, motherboard, RAM, network.. and then you also need to consider the router failing, and the cabling, and the power... and you also need to consider the datacentre failing (flood, fire, human error), and then you need to consider the external network failing (cables cut - all the time in some places!).

In short, you can worry about site downtime so much you'd never bother putting anything online at all! Or you could factor the risk of failure against the cost of redundancy and get a much more realistic approach. And of all the things I listed, the hard drive is the single most likely point of failure.

Next to human error, that is. Who type "shutdown -h now" when they wanted to reboot.... :(

gbjbaanb
  • 3,852
  • 1
  • 22
  • 27
0

My big worry is disks, as it seems that you can't buy the cheapies:

A major vendor notes:

'Most RAID controllers are designed to timeout a given command if the disk drive becomes unresponsive within a given time frame. The result will be that the drive will appear off line or will be marked bad and an alert will be given to the customer. Enterprise class drives (or drives designed for RAID environments), have a retry limit before a sector is marked bad. This retry limit enables the drive to respond to the RAID controller within the expected time frame. While desktop drives may work with a RAID controller, the array will progressively go off line as the disk drive ages and may result in data loss.'

That seems insane to me, another gotcha that ensures that the disk vendors will get lots of returns from people that 'don't know better'. However, I read that Google did a whitepaper (can't find it tho) that shows there is no difference in drive reliability between the two 'classes' offered by the storage vendors. I doubt Google use hardware raid controllers in their beige box fleet though.

Perhaps mdadm (in linux raid) has settings one can use to deal with the more impatient settings in desktop drive firmwares?

Perhaps in reality, everyone is paying for their warranty through a knobbled 'time-out' period in the controller firmware?

Pluto
  • 1