TL;DR This blog post is about the choices we made while trying to organize our infrastructure resources in AWS VPC and automating everything around it.

Our Infrastructure So Far

Since we signed up on AWS fairly recently, we were given a VPC by default. We used to launch EC2 instances seamlessly in the public subnet of the default VPC without caring much about the VPC specific details. Instances were secured using security groups but they were not well organized and since there was only one large subnet which was public, anybody could launch public instances in it without thinking much about security.

Moreover, there were other good things that we wanted to have which could not have been possible with the earlier state of our infrastructure like having a way to replicate the infrastructure across production and staging environments. As our servers grew in number, we realized that these problems will get bigger and harder to resolve if we don’t address them now.

Looking Ahead

Apart from the consumer facing app, we have a bunch of other products which cater to different aspects of the business. There are several small dev teams working on these products. Most of which iterate fast on their ideas and release their changes from dev environment to staging and eventually to production. We felt that to enable the dev teams to keep iterating fast from their dev machine to production, they have to have control over their infrastructure, so that our ops team doesn’t become the bottleneck in their release cycle. Giving dev teams the entire control over the infrastructure can be a risky business though. No matter how much ops-aware our developers are, they would always want to push things out fast, which will inevitably lead to mistakes of wide ranging magnitude. As responsible operations people, who also care for developer productivity, we have to come up with a plan so that our developers are empowered enough to move their pieces through the infrastructure freely in well defined silos without affecting other projects.

Another important thing to have for enabling developers to ship their code with speed while maintaining overall infrastructure sanity, is the ability to replicate the infrastructure across environments. A developer working on an HTTP API might want to setup a dev environment where he would like to deploy code often to test it out. In order to keep things moving fast, operations team has to make sure that the developer gets the resources he needs as easily and quickly as possible. After a while, the developer might want to set their code up in a staging environment on an infrastructure similar to the dev environment. We need to make sure all that happens in a way that is fast as well as less prone to errors.

The AWS Good Parts – VPCs and Subnets

In an attempt to address the problems we faced, we decided to use the VPC features that were available to us but were not used at all. One of many such features is the ability to create subnets within AWS VPC. This helped us think of our infrastructure split into subnets as small isolated units containing resources that serve the needs of an individual product or team. To make sure that developers can achieve high availability for their services, for each product we create identical subnets across different availability zones. This kind of organization enabled us to have better control over the security as we could use granular Network ACLs and routing on top of security groups.

In addition to the isolation of resources, it also allowed us to control the access to infrastructure more efficiently. Since, all the resources serving a product lie in a single subnet, we can control access to different operations on these resources, like launching or termination of EC2 instances. Only developers responsible for a product can do such operations on resources in the subnet that caters to that product. We achieved this by creating IAM policies with finely tuned privileges controlling access to the resources in each subnet and attaching those policies to specific users or groups of users.

While this is how we decided to organize our VPC in production, we wanted our infrastructure to look exactly the same in our test and staging environments. With this in mind, we went with one VPC for each environment, so that we can have a way to maintain parity across different environments. Apart from one VPC in each environment, we also have a common VPC with services that need access to or need to be accessed from all the environments. One such service is our VPN server, which allows us to access the resources in the private subnets in any of the environment specific VPCs. We achieved inter VPC communication between the common VPC and the VPCs of all environments by setting up Peering Connections between the VPCs.

This is roughly what our VPCs look like:

In the above representation, the arrows indicate the possible network flow. The details of multi-AZ subnet pairs and replication of subnets across environment specific VPCs has been avoided for the sake of simplicity.

Automating Everything using Terraform

With all our ideas combined, we would end up with an infrastructure with several small identical pieces, like a subnet for each product team and then those subnets multiplying across different VPCs for different environments. No matter how much we love working with AWS either via the management console or CLI, we don’t want to be managing VPC manually especially when reproducibility is one of the major goals. We evaluated a couple of tools which we could use to write common configuration to launch our infrastructure, which included Ansible , AWS CloudFormation and Terraform .

Since we were already orchestrating resources using Ansible, we started off with it. It didn’t take long before we hit a few blockers with using Ansible for the purpose of managing the skeletal resources of our infrastructure like VPCs, subnets, routing tables, etc. We then compared CloudFormation and Terraform and settled with Terraform because of its simpler syntax and out of the box support for various cloud providers. Within a couple of days, we laid down the Terraform code that would allow us to create our barebones infrastructure in any environment with a single command. This was a huge win for us, as it helped us to replicate our VPCs across our test, staging and production environments.

It took us quite some fiddling to figure out various pieces of our Terraform setup. One such problem was to figure out how to work with Terraform state, which was very important as we would need a way for Terraform to know what our infrastructure looked like the last time Terraform was run. We decided to keep it simple and commit the .tfstate files for each environment into our Terraform repo on Github. Since we are not in a situation where multiple Terraform runs are happening at once, we don’t end up in conflicting state files, so it works for us and hopefully should work fine for others as well.

Another problem was to organize the Terraform code so that it is easy to understand and not repetitive. To address that, we’ve created modules for repeating pieces, like subnets and route tables. To add a new subnet, we simply invoke these modules with arguments like name and CIDR range of the subnet. The following code takes care of creating three private subnets in one of our environment specific VPCs:

The implementation of our private_subnet module takes care of creating subnets with the specified names and CIDR ranges and associating them with the specified route tables.

Conclusion

Having a picture of what our infrastructure should look like in terms of our needs helps in organizing it well. All the choices we made might not be the best for everyone, but they fit our needs quite well. We’ve tried to focus majorly on isolation of small pieces of our infrastructure, and efficient access control.

So far, our VPC setup is helping us visualize how our instances and other resources are spread across our infrastructure and also decide where to place new resources. Our Terraform code is in such good shape that we can use it to replicate our VPC across different AWS accounts, or data centers. It will be quite interesting to see how this combination works for us in the long run.