2

I have heard it recommended to stay away from AWS hosting for certain "big data" applications (e.g. Hadoop, Cassandra, Solr) because EC2 instances typically use network attached storage (though there are more recently some high i/o instances, but they are apparently pretty expensive).

It makes sense to me that NAS would entail a pretty decent performance hit, but how much? Since AWS exists, then presumably there are plenty of examples of applications that make sense in this type of environment, but what is a good rule of thumb for determining if a particular application is a good candidate for AWS and NAS? (Besides sticking it on AWS and trying it out.)

John Berryman
  • 347
  • 3
  • 12

2 Answers2

5

Storage latency will be your metric.

If your application is highly sensitive to storage latency you'll want to steer clear of AWS and go physical, or drop the money to get the Storage Optimized instances. They specifically state those are the types for things like Hadoop and Cassandra.

The thing about the higher tiers of AWS instance types is that it isn't NAS, it's more NAS-backed physical. The details aren't clear, but you're much closer to the hardware when you're driving a storage-optimized for a cluster-optimized instance.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • Hmm... interesting. So *would* Hadoop be ok in NAS? Because I don't really care whether or not it takes me 8 miniseconds or 8 seconds to start streaming data because it will be streamed in such big chunks that an initial latency will not matter. – John Berryman Aug 15 '13 at 02:03
1

I am running Cassandra Cluster on AWS and I would agree with what you have read about staying away from NAS(EBS), I recently moved to hi1.4xlarge boxes (they come with 2 1TB SSDs) and I RAID0 them to gain the max out of it. With this setup I am easily able to handle 15k reads/sec, my app is not write oriented much so I cant help u there. Hope this helps.

APZ
  • 954
  • 2
  • 12
  • 24