Running in One AWS Region is a Design Choice!

AWS’s recent outage to DynamoDB and related services in the US East region is a good reminder of some fundamentals of designing for availability on AWS.

AWS’s recent outage to DynamoDB and related services in the US East region is a good reminder of some fundamentals of designing for availability on AWS.

If you need a primer on Availability Zones (AZs) vs Regions, check this out. This is an important distinction for services like EC2 and RDS that exist as a single virtual machine, so it is very unlikely that individual virtual machines in completely isolated AZs will go down, outside of a major natural disaster in the region. But many AWS services like S3 and DynamoDB are multi-AZ by default. This is great, but it also means that they have more cross-AZ dependency than one might first assume. They are all more susceptible to full-region outages because they actually do share some cross-AZ dependencies.

This DynamoDB outage was the first one in about 4 years… so this is not a common occurrence. Still, an outage is an outage. But if you had designed your use of DynamoDB for region-level redundancy, you could have survived this incident with zero downtime.

Happily, while region-level redundancy is not a trivial task, it really isn’t that bad, at least compared with how hard and expensive it is to build something comparable in a legacy data center world.

So where to begin? Here’s an initial checklist for thinking about region-level redundancy:

AWS’s DNS service Route 53 (along with other DNS services) allows failover, load balanced, and latency-based DNS record sets. These are a key building block of a redundant configuration. Route 53 can automatically monitor the health of your endpoints and automatically switch the DNS from one region to another if the first region becomes unhealthy.

If you can, use CloudFormation and a configuration management system (i.e. Ansible, Salt, or Dockerfiles) to replicate your stack in two regions. This is the best way to replicate your environment and application tiers.

If you cannot use those kinds of tools to automate, the next best option is to automate the copy of EBS snapshots or AMIs across regions using cross-region copy. You can do this with AWS’s API and script your own regular job, but AWS does not offer a simpler interface. If you’re looking for a simpler way to schedule snapshots and cross-region copies, check out Cloud Protection Manager.

Database are the hardest, of course, but far, far easier than in the pre-AWS world (i.e dark ages)

MySQL running in AWS’s RDS service can be configured for cross-region replication with a few clicks. This asynchronously replicates your database in a second region. With another few clicks or API calls, you can promote that replica to master.

If you do not need near-perfect data retention in a failover scenario, a cheaper option is cross-region RDS snapshot copies. Similar to AMI copy above, you can write your own script or something like CPM to simplify it.

If you are doing your own replication with services running on EC2 (rather than using one of the AWS built-in features mentioned above, you will need to figure out how to keep your data private when it is moving between regions. Remember than AWS VPCs (think of it as your LAN) can only exist in a single region… so you essentially need to connect two LANs to keep your data on your own network. You could create your own tunnel between EC2 instances. Trek10’s preferred method for a more robust connection is to use a virtual networking appliance running in EC2 from Cohesive Networks. AWS does have a feature called VPC Peering, but it currently only allows you to connect VPCs in the same region, but AWS has stated that they plan to add cross-region support to it in the future.

Whether or not you want to invest in the time to automate the failover process really depends on your Recovery Time Objective (RTO). If you need an RTO of under 1-2 hours, you really should automate. Between promoting your database to master, updating DNS, and any other app-specific changes, it should be relatively straightforward to script every step.

And finally, remember that the safest plan for solid region-level redundancy is active-active. Keeping two regions actively serving requests is the most complex architecturally and most overhead to manage, but it is the best way to make certain that you will be able to stay up if one region goes down.

So remember the bottom line… full region failures for a service, while infrequent, DO happen. If you need the highest possible level of uptime, multi-region redundancy is absolutely possible. Whether or not you take the time to do it is your design choice!