Do I need a second RAID controller for fault-tolerance?

Question

I have a server with 3 hard drives installed, and a total capacity of 6. We're planning to max it out, but our consultant also suggested getting a second RAID controller "for redundancy" to support the new drives. To me, this doesn't make much sense. Even with a second RAID controller running half of the disks, we're still stuck with only half of our disks/programs/data if one of the controllers dies (which isn't much better than running with none). We're putting vmware on the server and he vaguely mentioned some advanced fault tolerance/failover features, but if the disks are inaccessable due to a failed controller, how is it supposed to work?

Counting only reasons for redundancy, not performance, why would I want to have a second RAID controller in my server?

I have seen a history when the only RAID controller failed, making the multi-disk high RAID storage it has been servicing alone not just unusable but even all data there unrecoverable. It was a heavy blow to the company. Ultimately most of data has been reconstructed from the files found in workstations. Total shame. Always mirror data on the independent disk cluster with obviously another controller. Never assume RAID 6 will save your life in all cases if you rely on a single small card that gets 80° C hot while operating for many years 7/24. — h22, Oct 21 '16 at 07:02

Chopper3 · Accepted Answer · 2011-08-22T19:24:21.807

11

In a 'single box high availability' design then yes, you'd want a second controller, ideally on a second bus too. But this kind of approach has given way to a cheaper design based around clustering where one box failure doesn't stop service. So it depends on if you plan to use a clustered environment or rely on a single box. Even if your answer is the latter having dual controllers may be seen as adding extra complexity and maybe being overkill.

edit - based on your comment about using ESXi on your other question I'd have to say that its clustering is fabulous, we have many 32-way clusters that work brilliantly.

edited Aug 22 '11 at 19:24

answered Aug 22 '11 at 18:59

Chopper3

100,240
9
106
238

AFAIK, we're not going to use clustering. How would a second controller in a single box benefit me? Is there such a thing as controller failover? – Bigbio2002 Aug 22 '11 at 20:07
1

Not in an ESX/ESXi world no - a single one would be fine, make sure you get a controller that will make one big R10 array of all 6 disks but allow you to create these 2TB (or less) logical disks ok. HP's Pxxx-series let you do that btw. – Chopper3 Aug 22 '11 at 20:20

score 7 · Answer 2 · edited Apr 13 '17 at 12:14

7

A second RAID controller which is actively used is not for redundancy. Only if it is a cold-stand-by controller where you switch all your disks to when the first one dies. Then you have redundancy (for the controller). But beware of doing so, as posted here.

So the RAID is for redundancy of disks leading to a single point of failure at the controller. Having a second (unused) controller may solve this as you could switch all the disk to the new one. If this works depends on other factors...

I'm no native speaker, but for me "fault-tolerance" is something different than "redundancy". Can some English speaker help me out here?

edited Apr 13 '17 at 12:14

Community

1

answered Aug 22 '11 at 18:57

mailq

16,882
2
36
66

Redundancy is a way of achieving fault-tolerance :). I was looking for something along the lines of a cold-standby or a failover controller. Is this a feature that's supported, or would I have to manually swap out the cards? – Bigbio2002 Aug 22 '11 at 19:00
I have never seen a controller where the switching of disks is done automatically. This is either because I didn't looked for it or because I can't imagine how you should circuit the cables between one disk and two controllers. – mailq Aug 22 '11 at 23:57
Dual-ported drives are quite common in enterprise environments (think SAN shelves) - but the prices go up by a factor of 2 or 3, obviously. – adaptr Feb 16 '12 at 08:10

score 3 · Answer 3 · answered Oct 21 '16 at 08:47

On a single box, you actually need two RAID controllers, connected to two different PCI-E root complexes, to have complete I/O subsystem redundancy. This can be achieved by two different configuration:

use costly dual ported SAS disks, with each SAS link connected to a different controller. In this manner, each controller is connected to each disk. Obviously, the two controller can't operate on disks at the same time; some form of locking/fence is necessary to coordinate access to disks. SCSI has some special provision to provide the necessary fencing mechanism, but these must be coordinated by appropriate software. In other words, you can not simply connect a disk to two controller and call it a day; rather, you need appropriate software configuration to let it works without problems;
use normal and cheaper single link SAS/SATA disks, connecting one half of them to each controller. For example with 6 disks, you need to connect 3 disks to a controller and 3 disks to another controller. On each controller, configure a RAID array as needed (eg: RAID 5 or RAID1). Then, at the OS level, you can configure a software RAID between the two disk arrays, achieving full array redundancy. While cheaper, this solution has the added drawback to effectively halve your storage capacity (due the the software RAID1 level).

A key problem with both approach is that you do not have full system redundancy: a motherboard/CPU problem can bring down the entire system, independently from how much controllers/disks you have.

For this reason, this kind of redundancy-in-a-box is seldom used lately (apart that in mid/high-end SAN deployments); rather, clustering/network mirroring is gaining wide traction. With clustering (or network mirroring) you have full system redundancy, as a single failed system can not negate data access. Obviously clustering has its own pitfalls so its not a silver/easy bullet, but in some situation its advantages can not be negated. Moreover, you can also use asynchronous network mirroring to have an almost-realtime data redundacy on geographically different location, so that a single catastrophic event will not wreak havoc on your data.

With some kinds of data the copy that is only half updated (because the synchronization failed in the midway) may be unusable. A database is the typical example, but also various source code and data sets with lots of small files that are closely depent on each other. — h22, Oct 22 '16 at 17:19
It depends on the underlying replication mechanism. DRBD, for example, enable the use of a full (protocol C) or near-full (protocol B) synchronized replication. This means that when a write is acknowledge on the source host, it is actually commited on the remote host also In other words, write barriers are honored on both hosts). With such guarantee, any robust filesystem/database should recover without problems. — shodanshok, Oct 23 '16 at 20:27
Yes, some databases support the replication, and some other applications also. These are obviously much easier to work with. — h22, Oct 24 '16 at 08:16

score 1 · Answer 4 · answered Oct 17 '11 at 14:11

You'd need dual-ported SAS drives to provide actual failover on multiple controllers. While these do exist, it is decidedly uncheap - not in the price range of a single server that only has internal storage.

These are technologies often employed in SAN systems, where controller death is a real issue.

For a single server with no other failover capabilities, a second controller will not gain anything - it will just cost more money and provide the consultant with more profit.

Do I need a second RAID controller for fault-tolerance?

4 Answers4