How do you automate failover on EC2?

Question

Of the folks managing their own clusters (i.e. not using/paying for Amazon Autoscale, Rightscale, Scalr, etc.), how are you managing your instances on EC2 and handling (e.g.) failover? I'm wondering if most folks just end up writing their own boatloads of scripts against the EC2 API, as I suspect.

That's certainly our approach: whip up our own Python Boto-based monitoring/restarting daemon that runs off-site, listening for UDP keep-alives from our instances. On failure, we snapshot volumes, register images, start new instances, delete old volumes, and so on.

Every so often, when hacking on our scripts, I think there must be some open-source tools out there that deal with these issues already, and which don't have the constraints of (say) Scalr, but I always come back from Google empty-handed. (Things like Scalr have are pretty limited in the supported set/versions/configurations of software, and have specialized and IMO cumbersome ways of manipulating these setups.)

Also, the Linux-HA/Pacemaker ecosystem (Heartbeat, ldirectord, etc.) sounds like it isn't really suited for EC2. (But then I found this - though I'm not sure this is really a high-quality solution).

score 5 · Answer 1 · 2010-12-06T09:32:53.817

Well, I don't mean to just state the obvious, but the general idea is to push this complexity into the services managed by Amazon.

So on the frontend, you would use Amazon Elastic Load Balancing (ELB) to provide highly available load balancing. On the rear end, you use Amazon Relational Database Service (hosted MySQL), SimpleDB, and S3 for storage. All of these are managed by Amazon, and contain some sort of high availability / failover handling.

This typically leaves your web application servers, and any lesser common server types you might be using (rendering servers, self installed NoSQL data stores, etc).

Webapp servers are usually handled well enough with the health checks built into ELB. You can accept a small performance degradation when one webapp server is down, or consistently provision +1 server more than you need. Or if your config is simple, then when a webapp server fails, ELB together with Cloudwatch can automatically spawn a new webapp server for you.

Your own custom servers are another matter. For these it's true, you're on your own, and will need to make do with application built-in methods, or duct tape together something with custom scripts / open source HA tools.

Buying Rightscale's solution might be too expensive. But lesser expensive Amazon tools such as ELB, basic CloudWatch alerting (now free for 5 minutes resolution), or AutoScale are well worth it if you need high availability.

We're familiar with the AWS feature set, as well as their limitations. To take your first example, ELB is accessed via CNAME RRs, which can't coexist with SOA RRs, and thus can't serve TLDs, plus can't be accessed via static IPs - frustrations widely echoed in the forums. To take your second example, yep, RDS is MySQL, which is the giant limitation. Yes, we're interested in automating failover of our own machine types. Yes, private cloud deployment is relevant to us. Yes, I'm just curious. Etc. — Yang, Dec 06 '10 at 07:59
@Yang: You should have phrased your question more carefully, and saved me the trouble of typing up my answer. There is no one-size-fits-all solution to HA; it depends on the service in question, how state is kept, protocol failover properties, etc. You're right about the limitations/difficulties in using typical IP level HA tools on EC2. But there is no single answer that applies universally to "HA on AWS". — , Dec 06 '10 at 09:40

score 0 · Answer 2 · answered Jul 24 '12 at 18:26

RightScale has some great articles on how to automate failover on EC2. While most of them show you how to do it using RightScale itself, the principles are general and probably helpful to anyone thinking of how to set up a failover architecture on EC2.

score 0 · Answer 3 · edited May 23 '17 at 12:41

The issues you describe (HA, monitoring custom servers, 'duct-taping' services) are generally handled by a PaaS provider. Rightscale and Scalr were already mentioned in a previous answer and there are additional good options (see here for some PaaS options:

https://stackoverflow.com/questions/9542784/looking-for-paas-providers-recommendations)

You should consider which of the providers gives the closest fit to what you need.

Due notice: I work for cloudify, an open-source PaaS provider.

score 0 · Answer 4 · answered Mar 04 '14 at 18:26

I recently wrote a post on our engineering blog about how to use ELB in conjunction with Auto Scaling to achieve automatic failover for any kind of app. It covers how ELB health checks can be used to ping the status of your app and trigger auto scaling actions.

score 0 · Answer 5 · answered Jul 31 '14 at 12:38

You install heartbeat on both servers You attach an Elastic IP to the 'active' server You configure a script to do the failover by initiating an API request to obtain the elastic IP As soon as the 'stand-by' server got the elastic IP (takes about 30-60 seconds) it can be the master/active.

I don't have the specifics to provide here.

score -1 · Answer 6 · answered Dec 06 '10 at 01:42

-1

Amazon already provides Elastic Load Balancing... Why reinvent the wheel?

answered Dec 06 '10 at 01:42

Chris S

77,337
11
120
212

4

Because of ELB's various limitations? Because it requires CNAME and can't serve both foo.com and www.foo.com? Because I want to implement custom scheduling logic? Because I'm just curious how you'd implement ELB yourself in a cluster of unreliable VMs? Take your pick. – Yang Dec 06 '10 at 07:40
@Yang, you do it the same way you would if they were servers in your datacenter. There's no fundamental difference, no magic sauce that make it a cloud environment. – Chris S Dec 06 '10 at 13:35

How do you automate failover on EC2?

6 Answers6