Resources

Graduating Past Playbooks

We use Ansible to configure and manage our servers at Nylas, but we’ve taken a different approach to structuring our automation. Instead of using multiple playbooks, we use a single playbook with a “role-per-host” design that runs hourly via ansible-pull. This blog post is an overview of how this system works, and the ways it enforces consistency and modularity so you can scale faster and with fewer bugs. Today we’re also releasing a tool called ansible-test, which makes it easy to test ansible playbooks and roles using Docker.

Intro To Playbooks

One of the core concepts when using Ansible is the playbook, which is a set of automation that can be run on a machine. Typically a playbook corresponds to a type of server in your infrastructure. For example, a proxy playbook would set up your load balancers, whereas a database playbook would configure a new machine in your MySQL cluster. Under the hood, playbooks are comprised of individual roles, and each role is used to install+configure a single service. So in the previous example, the proxy playbook might have a role for HAproxy, Nginx, or another proxy server.

If you come from a chef background, the ansible playbook best resembles a chef cookbook. In puppet, a playbook is a manifest.

Ansible playbooks are a great replacement for old-school bash scripts when configuring new servers, but they have drawbacks as your infrastructure grows.

The DNA of Automation

As described above, Ansible playbooks are designed to run a subset of automation on a group of hosts. This means that as your app evolves, the number of playbooks required to manage your infrastructure will also increase. Unfortunately, it’s common to end up with dozens of distinct playbooks that run only on some of your servers, and keeping track of “what automation needs to run where” becomes a complicated and dangerous task.

At Nylas, we’ve reduced this complexity by foregoing playbooks. Instead, we assign a single ansible role to each host. When a server runs its automation, it only runs the role associated with that host.

In this way, our automation is like DNA– each cell (i.e. server) has a full copy of all the information which defines the organism (i.e. infrastructure), but only runs the specific part for it’s job. With ansible, the process of running automation is called converging.

Using Ansible Roles

Because Ansible still operates on playbooks, we use a stub playbook that can run any specific role. It looks like this:

Our servers are configured to run this automation every hour locally using ansible-pull. When the automation runs, it passes in a specific role that defines the machine type. Below, we are converging the mailsync role.

$ ansible-pull -i inventory.yml -e “role=mailsync” base_playbook.yml

At Nylas, we assign variables to this role based on the machine’s EC2 tag, which is dynamically resolved via an inventory script. We use the community ec2 inventory script. Any inventory script compatible with ansible may be used to inject inventory variables.

Structuring Multiple Roles

Close readers may have noticed a constraint with this one-role-per-host design. In particular, how is it possible to implement complex automation? To illustrate this, let’s look at an example HTTP web app server using the Python framework flask. (Code available here.)

For our web server, let’s start with a role called app; this is the top-level role that will be assigned to our host. Because we’re using flask, we also need to install python. We’ll also be using supervisor to monitor the app process, so we’ll need to install that too. The roles are structured as follows, with dependency wrapper roles for each component of the system.

This is a simple example, but you can easily imagine expanding this to a more complex scenario where there are many dependencies and sub-dependencies.

Variables in Roles

Let’s assume one day our app is ready to make the jump from python2.7 to python3.2 and we need to update our servers. In order to maintain modularity within our automation, this logic should not exist in the app role. Instead all python specific automation should be kept in a single python role. Therefore, we need some way for the app role to tell the python role which version of the python runtime to install.

We can achieve this using vars. If the python role has a var “python.version” then the app role can override that specific vars to tell the python role which version of python to install.

In order for this to work correctly, we need to set hash_behavior=merge in order to preserve namespaces and prevent ansible from overwriting values. We also use the convention of defining vars like “python.version” as defaults, which makes explicit which vars an upstream role can override. The top-level role can always override these vars for dependency roles.

Wrapper Roles for Ansible

When using chef, there’s a methodology called wrapper cookbooks that lets developers extend existing automation in a simple way. When one cookbook wraps around the other, it inherits the automation of that cookbook. The wrapping cookbook can then define attributes to override the default attributes provided by the underlying cookbook.

In some ways, our usage of ansible is essentially an implementation of chef’s wrapper cookbooks. (with cookbooks=roles, attributes=vars) This strategy has helped us maintain modularity in our automation and abstract away the complexity of many dependencies. No two roles attempt to automate the same thing. When a wrapper role gets too complicated or requires new automation, we can easily create a new dependency or wrapper role.

Wrapper roles also make it incredibly easy to share automation. Instead of copying a project’s ansible role into your fork and redefining its variables directly, you can just create a new role that declares the original role as a dependency. Then, simply override the vars you need for your fork. This strategy makes it easy to stay at the latest version without merge conflicts. It also adds flexibility for customization, and a path for submitting pull requests from contributors.

Introducing Ansible-Test: A Tool for Testing Role-Based Automation

Just like any piece of software, your ansible automation should be tested. The ansible documentation describes their philosophy:

“Ansible is actually designed to be a “fail-fast” and ordered system, therefore it makes it easy to embed testing directly in Ansible playbooks.”

However, we still feel the need for testing before pushing code to production. (Even if the ansible roles are correct, you may have mistyped that haproxy config which will take down the site!) Ideally, ansible would offer a solution to test automation locally before deploying new automation.

To solve this problem, we’ve started building a new tool called ansible-test which allows us to run role-based automation locally using docker and validate assumptions about automation changes. (Those familiar with Chef can think of this as “test-kitchen for ansible.”)

We already use ansible-test for our infrastructure automation, and today we’re releasing the initial source on GitHub. If you try it and helps your workflow, we’d love to hear your feedback. In the future, we’d like to expand ansible-test into a larger framework, with functionality to automatically detect whether assumptions about the roles are correct.

Wrapping up

In this post, we described how the infrastructure team at Nylas is able to configure our EC2 fleet using a modular and flexible system based on ansible-roles. Great infrastructure is a critical part of our business, and we hope sharing these ideas and code can help others in the community as well.

In future posts, we’ll be writing about things like how we deploy our app code as Debian packages, and various ways we combine Graphite, Sensu, StatsD, and Diamond to monitor the health and performance of our production servers. If you’d like to be notified of future posts, sign up below!