0

I'm working on creating a DR setup and runbook based on AWS.

I don't have any experience with creating DR setups so it would be really helpful if the experienced veterans can guide me through it.

Our Setup:

RDS MYSQL Aurora DB
ElastiCache
Ubuntu 16.04 Linux EC2 instances
Static files stored in S3
Route 53- Total of 250 record sets.
Application Load balancer

Everything is under the same VPC. We're trying to build a PilotLight DR setup.

Axel
  • 323
  • 1
  • 6
  • 17

1 Answers1

0

It depends on what you're trying to achieve and what kind of Disaster (that's the D in DR) you're trying to protect against. The most likely D is an Instance Failure (which includes EC2, ElastiCache node, RDS node, etc). Every other Disaster is quite rare.

Therefore in most cases it's enough to simply make your setup Multi-AZ with proper automatic fail-over and you're done. More specifically:

  • Aurora - make it Multi-AZ with at least 2 nodes. You can make a replica in a different Region for peace of mind.
  • ElastiCache - make it Multi-node across AZs. ElastiCache usually doesn't hold precious data, it's a cache after all.
  • S3 - enable versioning and possibly bucket replication to a different Region.
  • Route53 - don't worry, that's already global and not regional.
  • ALB - that's already Multi-AZ by default.

What's left are the EC2 instances. You should have them in auto-scaling groups (ASG) across multiple Availability Zones, which means that if one instance fails it is automatically recreated elsewhere. Needless to say this requires stateless instances, i.e. all your data should reside in the database or on a shared filesystem like EFS and not on the EC2 instances. Only then you can effectively put them in an ASG.

If that's too hard you can set up CloudWatch Alarm to automatically recover a failed instance - it usually works pretty well too.

Alternatively convert your apps to Docker containers and run them in Fargate cluster which again provides an auto-recovery in case of a container failure.

The bottom line is - when a deployment is property created in a cloud-native way there is almost no reason for the traditional manual DR since high availability and fault tolerance is inherently built in the deployment.

Hope that helps :)

MLu
  • 23,798
  • 5
  • 54
  • 81
  • Thanks for such a detailed post. Sorry, for not mentioning what type of disaster. We already have Multi-AZ for RDS as well as ElastiCache. For EC2 instances, we have AutoScaling Group based on 2 AZ. I was working on DR setup in case whole aws region fails for some reason. Currently, we have our setup based in North Virginia. I've to create a DR setup in a different region. – Axel Jan 06 '20 at 09:54
  • 1
    @Axel Sure, regions may fail but it’s very very rare and if it fails it’s rectified asap, often faster than you can execute your DR run book. If you are worried about a region failure forget manual DR runbooks and look at **multi-region active-active architecture**. That’s the proper way to do high availability in the cloud. – MLu Jan 06 '20 at 11:31