Blue/Green Infrastructure with Terraform

Reducing the fear of introducing breaking changes in the cloud

Infrastructure as Code is one of the cool things right now. Every DevOps-related conference in the past two years had a talk or two about the subject, and that’s a good thing.

In the wake of the DevOps movement, HashiCorp emerged as one of the most respected companies in that space. Today I’m going to talk about one of their products: Terraform.

What is Terraform?

Terraform is a tool which allows to easily manage cloud resources in a declarative way. Using a simple Programming Language, it lets you define pretty much the shape of a cloud infrastructure including VPCs, Subnets, Compute Instances, Load Balancers, DNS Records and so on. It works with every major cloud provider, but it’s not cloud-agnostic. That means you can create for example a Load Balancer in AWS or Google Cloud, but the code will be slightly different for each of them.

What is Blue/Green deployment?

Blue/Green deployment is a DevOps practice that aims to reduce downtime on updates by creating a new copy of the desired component, while maintaining the current. Given that, you end with two versions of the system: One with the actual version (blue) and another with a newer one (green). When the new version is up and running, you can seamlessly switch traffic to it. This is useful not only to reduce downtime, but also to improve rollback time when something bad happens.

Example 1

Example 2

Blue/Green Infrastructure

While Blue/Green deployment is a technique more commonly used with application deployment, the reduced costs of the cloud, in conjunction with the tools we have right now, make possible to have two copies of an entire cloud infrastructure with little to no pain.

It is important to note that doing Blue/Green deployment of an entire Cloud Infrastructure is not a silver bullet and certainly a bit too much if you are doing small changes (for example, adding a new EC2 Instance to your stack). But for major/breaking changes is a win and I personally recommend it.

Terraform to the rescue!

I’ll be using Amazon Web Services for this tutorial, but the code won’t vary too much with another provider.

After finishing this, you will be able to create an infrastructure containing:

A Virtual Private Cloud

Three Subnets, each one in a different Availability Zone

A Security Group

Three EC2 Instances serving an NGINX Server on the Port 80 (each one in a different subnet)

To follow this tutorial, you need to have your AWS Credentials configured in your Environment, with at least the EC2FullAccess policy attached.

Creating a VPC (Virtual Private Cloud)

I know this is a Terraform tutorial, but a recommended practice is to have a manually created VPC. You can create VPCs with Terraform, but there are a lot of external services that rely on knowing your VPC ID beforehand, so it is better to not create a new one every time on every Blue/Green deployment.

Also, you may have security groups that are created externally by another team in your organization. For that matter, we will be creating a VPC using the AWS Console. You can also create a VPC with the command line by doing:

Installing Terraform

You can download Terraform by either going to this link or by using any Package Manager (brew, apt)

Creating the project

Create a new folder in your workspace and name it terraform_blue_green. Then, initialize a GIT repository, add a simple .gitignore that ignores the .terraform folder and open the folder with your favorite text editor. In my case, I’ll be using Visual Studio Code.

Initializing the Terraform State

Terraform stores the state of the infrastructure in a JSON File. It is recommended (and required for this tutorial) to store that file on an external backend like Amazon S3. As I’m using AWS for this Tutorial, I’ll stick to S3, but Terraform supports the equivalent in each provider.

First of all, you need to create the S3 bucket in which the state will reside. You can do this either by going to the S3 Console or by doing:

> aws s3api create-bucket --bucket terraform-bluegreen

Then, create a file named bootstrap.tf inside the project folder, with this content.

In this file we have defined

The version of the Infrastructure

The Cloud provider we will be using (in this case, AWS)

The Backend in which the state will be saved (in this case, S3), and the configuration attached to it.

Using existing resources in Terraform

As we’ll need the ID of the previously created VPC to do anything in our infrastructure, we will be storing it in a variable. To do this, create a file named vpc.tf with this content:

Creating our first resource: Subnets

To do anything useful, we first need subnets. We will create three of them, each in a different Availability Zones. Create a file named subnets.tf, with this content:

In this file we created three subnets specifying:

Count: The number of Subnets we want to create

Availability zone: In this case we are using the element() function which function takes a list and an index and returns the element, even if the index is greater than the number of elements. This is useful to assign a different availability zone to each subnet.

VPC ID

CIDR Block: This is probably the most confusing part. We interpolated the previously defined infrastructure_version variable into the CIDR block. This will help in the future, when creating the second version. You may change the CIDR Block with the one you defined in the VPC.

Assign a Public IP by default to any Network Interface assigned to this subnet

Name: We’ve appended the Infrastructure version into it

With the file in place, first do:

> terraform plan

+ aws_subnet.terraform-blue-green[0]...

+ aws_subnet.terraform-blue-green[1]...

+ aws_subnet.terraform-blue-green[2]...

Plan: 3 to add, 0 to change, 0 to destroy.

The plan command does a dry run and tells you what changes will be done. It is important to plan before doing anything, as you can spot errors. In this case, the plan tells us that it will add three subnets, and that’s what we wanted.

So now we can run this:

> terraform apply

Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.

Now, you can go to the AWS Console (under the VPC/Subnets section) and your subnets should appear

Creating a Security Group

To be able to access our resources in the future, we need to create a Security Group in our VPC. For the sake of simplicity, we will be creating a Security Group that enables all inbound traffic from everywhere.

Create a file named security_groups.tf in your project, with this content:

In this file, we’ve created a security group for our VPC (using its VPC ID) and two Rules: One for Inbound traffic and one for Outbound traffic. The most important parts are:

From/To Port: The port range the rule applies for. In this case, we target all possible port ranges

Protocol: You can use either HTTP, TCP or “-1”, which applies for both TCP and HTTP

CIDR Blocks: A list of CIDR blocks that are enabled by the rule. In our case, we enabled all ipv4 traffic.

With the file in place, run a terraform plan and terraform apply. After that, we should be able to see our security group in the AWS Console (under EC2/Security Groups).

Creating an SSH Key

To be able to access an AWS Instance later in the future, we need to assign an SSH Key to it.

First, create a Key Pair by using ssh-keygen:

> mkdir keypairs> ssh-keygen -f keypairs/keypair -P ""

Generating public/private rsa key pair.Your identification has been saved in keypairs/keypair.Your public key has been saved in keypairs/keypair.pub.

Given this is a tutorial, don’t bother moving the Private Key to a secure place (but you should definitely do it).

Then, create a file named keypairs.tf in the root folder of the project. Give it this content:

Then do:

> terraform plan

+ aws_key_pair.key_pair...

Plan: 1 to add, 0 to change, 0 to destroy.

> terraform apply

Terraform will perform the following actions:

+ aws_key_pair.terraform-blue-green...

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.

Enter a value: yes

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Now, the Key appears on the AWS Console (under EC2/Key Pairs)

Creating (at last!) EC2 Instances

Create a file named instances.tf, and paste the following:

Let’s explain a little bit about this file:

We have created a resource of type aws_instance, with this parameters:

Count: The number of resources of this type. In this case, we will create 3 instances

AMI: The Amazon Image for the instances. In this case, we chose the Official AWS ECS Image, as we will run a docker container Inside of it. Keep in mind that this AMI only works in the US-WEST-2 region, so check this link if you are in another region.

Instance Type: The instance type of the instances

Subnet Id: As explained above, the element() function takes a list and an index and returns the element, even if the index is greater than the number of elements. This is allows us to assign a different subnet id to each instance.

Key Name: The name of the key pair. We chose the previously created one

User Data: This allows us to assign an initialization script to the instance. In our case, we are running an NGINX Docker Container and exposing it in the port 80. There are better ways to define User Data Scripts, but we’ll keep it simple for now.

Accessing the public DNS of the Load Balancer from a browser should display the NGINX page.

Yes, I’ve used the same screenshot twice

You should also be able to see it in the AWS Console (under EC2/Load Balancers)

(Optional) Assign A DNS Record to the Load Balancer V1

I’m not going to cover too much of this case, but what I’ve ended up doing in production is creating a DNS Record that points to a specific version of the Load Balancer. An example of this in terraform could be:

Commit your changes

Commit your changes so far

> git add .> git commit -m "Version 1"

Manually Pointing a DNS record to the Load Balancer

DevOps is not all about Automation. In some cases, it’s a good practice to have a minimal human interaction. In our case, we will assign a DNS record to the desired version of the infrastructure (via the load balancer).

To be able to perform this step you’ll need to have a registered domain and the corresponding Route 53 Hosted Zone.

Enter the desired Hosted Zone and create an A Record with an Alias of the previously created load balancer (terraform-blue-green-v1…).

This is the entry point of your system and what your clients will be accessing.

Creating the Infrastructure V2

First, create a new branch in your repository (and I seriously recommend removing the .terraform folder):

> git checkout -b v2> rm -rf .terraform

Now, modify bootstrap.tf with this:

As you see, you need to modify both the infrastructure_version variable and the key of the S3 Bucket. I’ll be nice if terraform allowed to interpolate the infrastructure_version variable in the key, but for now it’s not possible. There is an issue in Github though.

Now, as you deleted the .terraform folder, you need to reinitialize the state:

> terraform init

Now modify your instances.tf with this content:

(We’ve changed the instance size from t2.micro to t2.medium. You can chose whatever you like)

Doing a terraform plan will reveal that in fact, terraform will create all resources again.

> terraform plan...Plan: 11 to add, 0 to change, 0 to destroy.

After doing terraform apply, you should end with an entire new infrastructure, without changing the old one.

Instances

Subnets

Security Groups

Load Balancers

Routing traffic through the new Infrastructure

As we did previously with Version 1, point your DNS record to the new load balancer using an ALIAS.

Deleting the old infrastructure

When all traffic starts going to the new Load Balancer, it’s time to delete the Version 1 of the infrastructure.

To do this, first commit all the changes in Version 2, and then checkout the old version again. Delete the .terraform folder and initialize the state again.

A note of caution

In this guide, we used a DNS Record to select which infrastructure version is the production one. While this works most of the time, there are some cases when some client-side libraries cache DNS Entries, so you should wait some time to get the traffic to drain from the old balancer. You can solve this by maintaining a manually-created load balancer and changing its instances.

Conclusion

Terraform provides a clean and declarative way of defining Infrastrucure as Code. Thanks to that, we can use it to perform things that seemed impossible a few years ago.

I want to end this article by saying that this approach has a couple of downsides, and I will write an article in the future explaining how to achieve the same results by using Terraform Modules (those in fact provide better flexibility overall).