1

I'm about to set up a Linux cluster of 5 physical server nodes (more nodes to be added later, probably).

  • the cluster will be managed by Proxmox (and yes, it works in software RAID)
  • shared storage will be implemented with Gluster in redundant setup with each physical server holding a brick (so, data will be redundantly available from all machines)
  • Percona XtraDB cluster will be used as main, multi-master database - again with data shared by all physical machines
  • each machine will have two HDD about 2-3 TB in size each, in RAID1 setup
  • all machines will be hosted in a large datacenter with redundant power supply etc..
  • server specs can be seen here
  • the scope of the whole cluster is to distribute workload + allow high availability. A machine can go down at any time without being a problem for the whole system.

One of the decisions left to take is whether to use software RAID1 or hardware RAID1 + BBU.

Software RAID is the solution I'm very familiar with (I'm managing a number of servers since 15 years and I know how the tools work). I never had a serious problem with it (mainly only the HDD fail). These are the reasons why I prefer software RAID.

What I dislike about hardware RAID is the incompatibility between controller vendors and the lack of experience I'm having with them: different configuration options, different monitoring method, different utility programs - not a good feeling for creating a cluster system.

I know that, when using a BBU, hardware RAID can both be fast and reliable (write through cache). However, since all data will be stored in a highly redundant manner in the cluster, my idea is to use software RAID1 and disable barriers in the file system to increase write performance. I expect that this will lead to similar performance like hardware RAID1. Of course, I risk data loss due to the volatile write cache, however IMHO that should be handled by the clustering mechanisms anyway (the whole machine should be able to restore data from the other nodes after failure).

I'm not having concerns about the CPU resources needed by a software RAID implementation.

Is my assumption correct or am I missing some important detail that would help me making the right choice?

Udo G
  • 423
  • 4
  • 9
  • 19
  • With a modern CPU, there's probably not a lot in it. But don't forget - the times when your IO is 'hot' and using a lot of CPU cycles will also be the times when your cluster is trying to 'hog' as many CPU resources as it can. – Sobrique Jan 15 '15 at 09:53
  • The clustering mechanisms rely on write barriers to ensure that data has been written to any given node. They won't 'handle it anyway' – JamesRyan Jan 15 '15 at 10:30

1 Answers1

2

I prefer software RAID to hardware RAID on single servers, because hardware RAID forces the admin to take precautions against hardware failure of the RAID controller. This usually requires stocking and regularly testing spare RAID controllers.

In your setup, though, I assume, that redundancy is on the node level, not on the disk level. If a node fails for any reason (cpu, power supply, raid controller etc.), that node will go off the cluster and will be replaced ASAP with a new or repaired node, and then data will be rebuild from the cluster, not from the RAID. Having said that, the question is, wheather you need a RAID at all!

You might say: "My database is mostly read, a RAID 1 will approximately double the throughput, as reads can be distributed between both disks". But be aware, that a disk failure followed by replacement of that disk and rebuild of the RAID reduces the read rate on that node temporarily to single disk level. If your database cannot share the traffic reasonable between unequal nodes by giving less traffic to the slow nodes, than the whole load the database can handle drops to half of the normal value! That might force you to take a node with a disk failure completely off the database anyway, as long, as it is busy with its internal RAID rebuild. But that renders the RAID mostly useless.

The alternative is to not use any RAID, but let any node join the database twice, once for each disk. That puts more burdon on the CPU, but if disk is your limiting factor, then who cares about CPU time? And if a disk fails, that particular half node goes offline, and joins again, once the disk has been replaced. So the load will be shared fair to all disks.

If you have a high write load, the seperate disk solution will give you twice the write throughput than a RAID 1.

So basically, the only reason to still think about the BBU is, if your latency requirements are so narrow, that you cannot wait for the data to physically go to disk. In case of a power failure, the BBUs will ensure, that data is still written. But there are alternatives, namely SSD cache modules like dm-cache or bcache. In write back mode, they write data into the SSD first, which is much faster, than the write to disk, and then already commit the write. Even after a power failure, they will correctly read the blocks from the SSD. dm-cache and bcache come with all recent linux kernels, and a small (like 64 or 128 GB) server-grade (!!) SSD is still cheaper than the BBU RAID controller.

Kai Petzke
  • 378
  • 1
  • 3
  • 10
  • Your separate-disk solution seems interesting. However, without RAID on a single node I must use one of the two as the system (boot) disk. If that one fails, the node won't boot up anymore. That would mean that on failure of the "main" disk, the node would have to be set up again, right? – Udo G Jan 15 '15 at 10:49
  • 1
    There are many solutions to the boot issue. For example, you can partition both disks to have a small root, a small swap and a large data partition. Then put a software raid 1 onto the two root partitions of the two disks and your node will happily continue to run, if one disk fails. – Kai Petzke Jan 15 '15 at 10:56
  • Of course, you're obviously right. Seems like a good solution to me. – Udo G Jan 15 '15 at 13:46