Amazon AWS Ephemeral disks and RAID1

Question

Some AWS instances have "ephemeral disks" attached, which are much faster than EBS. But ephemeral disks will be blank and uninitialised when your instance is stopped and started. The data on disk generally survives an instance reboot though.

Question: Should I use a software RAID1 on my AWS instance, built over an ephemeral disk and an EBS volume?

My thinking is that the raid1 will come up in degraded mode with the EBS volume only, and then we can use mdadm commands to add the blank ephemeral disk back into the raid. This will allow the app to start up 5-10 minutes sooner, at the cost of worse performance while the raid1 synchronises.

Background: I have an app that uses ~40 GB of data files. Access times directly corellate with performance, so faster the disk the faster the app works.

Historically we've run something from rc.local to rsync data from an EBS disk to the ephemeral disk, and then started the software. The synch takes 5-10 minutes, better than the 5-20 minutes it took to synch from another instance. In the past we've even used the data files from a ramdisk, which was not as fast as the ephemeral disks.

More info - this is an i3.4xlarge so it has 2x NVME ephemeral drives.

# hdparm -t /dev/md? /dev/nvme?n1 /dev/xvd?
/dev/md0:     9510 MB in  3.00 seconds = 3169.78 MB/sec RAID0 of two eph drives
/dev/nvme0n1: 4008 MB in  3.00 seconds = 1335.74 MB/sec Eph drive
/dev/nvme1n1: 4014 MB in  3.00 seconds = 1337.48 MB/sec Eph drive
/dev/xvda:     524 MB in  3.01 seconds = 174.17 MB/sec  gp2 16GB, 100 IOPs root
/dev/xvdf:     524 MB in  3.01 seconds = 174.23 MB/sec  gp2 120GB, 300 IOPs data
/dev/xvdz:     874 MB in  3.01 seconds = 290.68 MB/sec  gp2 500GB, 1500 IOPs raid-seed disk

I have created a raid1 with

mdadm  --create /dev/md1 --raid-devices=3 --verbose --level=1 /dev/nvme?n1 /dev/xvdz

which returns:

$ cat /proc/mdstat
Personalities : [raid0] [raid1]
md1 : active raid1 nvme1n1[4] nvme0n1[3] xvdz[2]
      524155904 blocks super 1.2 [3/3] [UUU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

Curiously, the raid reads about as fast as the faster drives, and is not limited to the speed of the slowest disk.

/dev/md1:     4042 MB in  3.00 seconds = 1346.67 MB/sec
/dev/nvme0n1: 4104 MB in  3.00 seconds = 1367.62 MB/sec
/dev/nvme1n1: 4030 MB in  3.00 seconds = 1342.93 MB/sec
/dev/xvdz:     668 MB in  3.01 seconds = 222.26 MB/sec

A power-off/on returns a degraded raidset, but the app can still run albeit slower. The cost of waiting 5-10 minutes is avoided, and I can re-add the ephemeral disks to the raid on the fly without an app restart.

So while it seems to work perfectly, is there anything I've missed or not considered?

How often does the data on your disk change? What's the RTO and RPO? Interesting idea to span RAID across them, but it seems a bit "hacky" and I wonder if there's a better solution. EFS with some kind of disk cache perhaps, a script to populate the ephemeral disk from EBS, that kind of thing — Tim, Dec 17 '18 at 00:29
@tim data files are updated once every 3 months, and is read-only by the application. Recovery time isn't particularly important, the host is redundant, however the app just runs slower as disks get slower, which is why fastest-disk-possible is important. We already have a hacky script to populate from EBS. I'm spinning this up at the moment and will provide some timing tests soon. — Criggie, Dec 17 '18 at 00:48
Further, with a raid0 of the two eph disks, the application's performance metric is ~1345 ms. With a raid1 of two eph disks plus an EBS disk, it gets 914 ~ms So its running better than before. — Criggie, Dec 17 '18 at 02:56
In a Raid1 it might speed up your read load, but nut your write load. An alternative would be a Read Cache (blockcaache/lvmcache). That’s probably better to control policy and has no rebuild penalty. — eckes, Dec 17 '18 at 04:05
Instance types have a maximum EBS throughput. In the case of the i3.4xlarge it's 16,000 IOPS / 437.5 MB/s [(link)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html) So you're only currently getting half of your available EBS bandwidth, possibly due to your EBS disk settings — Zac Faragher, Dec 17 '18 at 06:03
If you don't want to go all out on EBS IOPS, you could look at provisioning several lower IOPS disks and RAIDing them together as well, up to your Instance-max IOPS — Zac Faragher, Dec 17 '18 at 06:12
Given you have redundant servers I think a script to copy from EBS on start-up is a good solution. You could alternately create an AMI, which lazy loads from S3, though that is probably slower, you can use dd to force all blocks to be read. Reading from S3 using multiple threads (5? 50? Try it and see) to an ephemeral disk using an S3 gateway should be pretty quick, probably quicker than EBS. EFS is also much higher throughput than EBS, but could be limited to a single instance. — Tim, Dec 17 '18 at 06:41
@eckes Good point - in this case there is no write load during normal operation. The only time the files are written is during an update, when the host is out of the cluster for 10 minutes every 3 months. — Criggie, Dec 17 '18 at 10:37
@ZacFaragher sadly I can't find AWS's IOPs graphs for the Ephemeral disks, but the EBS disk in the raid shows under 1 read IOPs and flatline 0 write IOPs. — Criggie, Dec 17 '18 at 10:40
I haven't tried this, so not an answer, but have you considered using a caching filesystem that stores data on the ephemeral drives but retrieves from elsewhere? You'd still see short-term performance degradation, although the nature of the degradation would be different (first-touch versus background rebuild). But it seems like a simpler solution. — kdgregory, Dec 17 '18 at 11:56

score 5 · Answer 1 · answered Dec 17 '18 at 01:25

Hmm, I'm not sure I would want to mix two so different volumes in a single RAID1. If you do that half of your requests will be served from the slower EBS and half from the faster instance storage and that may lead to quite an unpredictable performance. I would look at standard tools to achieve a better performance.

Look at Provisioned IOPS EBS disks (if you need high random-access IO) or Throughput optimised EBS (if you're sequentially reading large files). They may provide the performance you need out of the box. The pricing is here.

You should also look at some caching, especially as it's mostly read-only contents as you say. Every time the file is needed you can have a look in the local cache dir on the ephemeral storage and if it's there serve it from there. If not take it from EBS and save a copy in the cache. Especially if it's all read only it should be quite a simple caching layer.

Or if the files on EBS are database files (which I suspect may be the case) cache the results of your queries or processing in Memcache or Redis or in the database native cache (e.g. MySQL Query Cache).

Hope that helps :)

That's what I'm testing - could be that the first read satisfies the request, so all the reads will be served from the faster disk. — Criggie, Dec 17 '18 at 01:58
Aslo, IO1 disk with maximum IOPs is not as fast as ephemeral disk for this particular use case, plus IO1 is quite pricy at that level. — Criggie, Dec 17 '18 at 01:59
The problem of half the requests being served by the slower media should be possible to fix using `--write-mostly`. — kasperd, Dec 17 '18 at 09:50

score 1 · Answer 2 · answered Dec 17 '18 at 05:04

1

40GB is small enough for RAM Disks, which will be faster than scratch disks. How long will your app run, and is it worth paying for an instance with larger memory allocation?

24x7 may be too costly. But 40GB is within reach.

As a bonus you should enjoy more cores.

I agree with Query Caching for deterministic queries, and any sort of buffering will help over time.

answered Dec 17 '18 at 05:04

mckenzm

254
2
7

1

Back when this app was running on physical servers, a ramdisk was indeed the fastest solution. Now its in AWS, the ephemeral disk is faster than a ramdisk **for this application** which is what lead me to trying to get the speed of eph disks and safely work around the "blank on cold-start" nature. An i3.4xlarge has 128GB ram, and the app uses about 2/3 of that now. Its an in-house app so its quite custom (read that as you will :)) – Criggie Dec 17 '18 at 10:44
1

Crikey, of course you would want a warm start, copying images being faster than a file system build. A shame they can't just be tagged off to a "nano" instance to keep them alive. – mckenzm Dec 17 '18 at 21:57

score 1 · Answer 3 · answered Dec 22 '18 at 10:41

I... wouldn't use a RAID1 volume, even with --write-mostly. The performance degradation while the set rebuilds is going to get annoying.

What I would recommend looking into instead is bcache. I've found it to be very useful in situations where I've got access to SSDs, but also have a very large amount of data to store (usually very large PostgreSQL databases) for which it isn't cost-effective to purchase all SSDs. I've only used it in "persistent" mode, where it uses the SSDs as a write-back cache, but it does have a mode where the cache storage layer is treated as ephemeral, and no writes are considered complete until they're on the underlying permanent storage.

Amazon AWS Ephemeral disks and RAID1

3 Answers3