7

My thinking behind this is that RAID 1 creates two or more copies of the data on multiple EBS drives. Yet, aren't Amazon EBS disks virtually fail-safe because they live on multiple physical drives? So then in terms of reliability, you aren't gaining much by adding RAID 1. Is this correct or my facts wrong? I realize you would probably still gain read performance benefits from RAID 1.

2 Answers2

18

Yes, EBS is fault tolerant on the back end, but EBS failures do occur and in unexpected ways. What you don't see is the type of failure that most of us are used to - drive goes bad and just fails outright. The most frequent failure is a huge and unpredictable increase in latency which can make your application unresponsive. With RAID1 or RAID 10 sets, you can simply fail the problem drive out of the array and replace it with a new one with no downtime.

Recall ec2pocolypse a couple months ago where a large percentage of EBS volumes became completely unresponsive. Those of us that had RAID10 sets were able to recover easily by failing out a drive or force detaching it with the API. Those that did not (I'm looking at you, reddit) had to suffer through just shy of a week of downtime.

If you actually care about your data, you should never, ever, under any circumstances RAID0 it. By doing this, you increase your probability of failure while reducing your ability to recover from that failure. Snapshotting is great, but unless you stream your binary logs (for example), you cannot perform a point in time recovery. If you are in e-commerce, people get upset when they pay for something that doesn't end up getting shipped because there is no longer any record of it in the database.

I recently wrote about RAID10 EBS after experiencing yet another success from EBS RAID: http://blog.9minutesnooze.com/raid-10-ebs-data/

The question is...who do you trust more with your data? Amazon? or yourself?

Aaron Brown
  • 1,677
  • 1
  • 12
  • 21
  • Also, doesn't the mirroring in RAID10 give up to double the read performance? Since reads will be distributed evenly to the underlying EBS volumes? – Nic Cottrell Jan 18 '13 at 13:16
  • I wonder if the huge latency periods are caused by disk failure where it fails over to your most recent s3 snapshot...http://stackoverflow.com/questions/13576363/does-taking-a-snapshot-of-an-ebs-volume-increase-reliability – rogerdpack Sep 09 '14 at 22:13
  • So just clarifying, you would recommend RAID 1 in the case of something like a drive for primarily mysql data (or any better options)? (as well as snapshots etc) – Ian Oct 10 '15 at 11:51
4

Behind the abstraction the drives are already redundant. It is fine to run them in RAID 0 for speed. What is optimal is to use the snapshot functionality for backups. On RAID, this can be done by breaking down the RAID or freezing the volumes, snapshotting, then returning the drives to normal use. Alternatively, writing the data to a single EBS volume and snapshotting that can cover other issues as well, such as instance failure which may leave the RAID drives in an inconsistent state, even when reattached.

TL:DR; Using RAID 1 is overkill, better to prepare for other failure scenarious with robust backups

Flashman
  • 1,311
  • 10
  • 9
  • 2
    Thanks for the answer. I forgot to update this with the results of my research. Basically, what I was looking for is that yes indeed there is a 0.5-1% annual failure rate for these EBS drives. So, therefore you can not rely on the redundancy that Amazon provides. Yet, you also can not rely on using RAID 1 because the failure of two EBS drives is much less independent of each other than in a typical dedicated server. As you mentioned the best solution (besides regular backups) is to make frequent snapshots as explained here https://forums.aws.amazon.com/thread.jspa?messageID=124224𞕀 – Sameer Parwani Apr 06 '11 at 04:09