My Life as a Sys Admin

Category Archives: Ansible

We have been using Ansible for all our Deployments, Code Updates, Inventory Management. We have also built a higher level API for Ansible called bootstrapper. Bootstrapper provides a rest API for executing various complex tasks via Ansible. Also, there is a simple python CLI for bootstrapper called bootstrappercli which interacts with the bootstrapper servers in our prod/staging env and performs the tasks. We have been using slack for our team interactions and it became a part of my chrome tabs always. We also have a custom slack callback plugin, that provides us a real time callback for each step of playbook execution.

Since bootstrapper is providing a sweet API, we decided to make Slack to directly talk to Ansible,. After googling for some time, i came across a simple slackbot called limbo, written in Python(offcourse) and the best part is its plugin system. It’s super easy to write custom plugins. As an initial step, i wrote a couple of plugins for performing codepush, adhoc and EC2 Management. Below are some sample plugin codes,

Configure Limbo as per the Readme present in the Repo and start the limbo binary. Now let’s try executing some Ansible operations from Slack.

Let’s try a simple ping module

Let’s try updating a staging cluster,

WooHooo, it worked 😉 This is opening up a new dawn for us in managing our infrastructure from Slack. But offcourse ACL is an important factor, as we should restrict access to specific operations. Currently, this ACL logic is written with in the plugin itself. It checks the user who is executing the command and from which channel he is executing, if both matches only, the Bot will start the talking to Bootstrapper, or else it throws an error to the user. I’m pretty much excited to play with this, looking forward to automate Ansible tasks as much as possible from Slack 🙂

As we all know this is the era of cloud servers. With the emergence of cloud, no need to worry about the difficulties in hosting servers on premises. But if you are cloud engineer, you definitely know that any thing can happen to your machine. Unlike going and fixing on our own, in cloud its difficult. Even i faced a lot of such weird issues, where my cloud service provider terminated my server’s which includes my Postgres DB master also. So having a self healing cluster will help us a lot, especially if the server goes down in the middle of our sleep. Stateless services are the easiest candidates for self healing compared to DB’s, especially if we are using a Master-Slave DB architecture.

For those who are using Docker and Mesos, Marathon provides similar scaling features like Amazon ASG. We define the number of instances that has to be running, and Marathon makes sure that number always exists. Like Amazon ASG, it will relaunch a new container, if any container accidentally terminates. I’ve personally tested this feature of Marathon long back, and it’s really a promising one. There are indeed other automated container management systems, but the marathon is quite flexible to me.

But in Clementine, in our current architecture, we are not yet using Docker in Production and we heavily use AWS for all our clusters. With more features like Secure messaging, VOIP etc.. added to our product, we are expanding tremendously. And so does our infrastructure. Being a DevOps engineer, i need to keep the uptime. So this time i decided to prototype a self healing cluster using Amazon ASG and Ansible.

Design

For the auto healing i’m going to use Amazon ASG and Ansible. Since Ansible is a client less application, we need to either use Ansible in stand-alone mode and provision the machine via cloud init script, or use the ansible-pull. Or as the company recommends, use Ansible Tower, which is a paid solution. But we have built our own higher level API solution over Ansible called bootstrapper. Bootstrapper exposes a higher level rest API which we can invoke for all our Ansible management. Our in house version of Bootstrapper can perform various actions like, ec2 instance launch with/without EIP, Ahdoc command execution, server bootstrapping, code update etc ….

But again, if we use a plain AMI and tries to bootstrap the server completely during startup, it puts a heavy delay, especially when pypi gives u time out while installing the pip packages. So we decided to use a custom AMI which has our latest build in it. Jenkins takes care of this part. Our build flow is like this,

The above POST request to Ansible performs EIP management and Ansible will assign the proper EIP to the machine without any collision. We keep an EIP mapping for our cluster, which makes sure that we are not assigning any wrong EIP to the machines. If no EIP is available, we raise an exception and email/slack the infra team about the instance and cluster

Now create an Autoscaling group with the required number of nodes. On the scaling policies, select Keep this group at its initial size. Once the ASG is up, it will start the nodes based on the AMI and Subnet mentioned. Once the machine starts booting, cloud-init script will start executing our userdata scripts, which in turn talks to our Bootstrapper-Ansible and starts assigning EIP and executing the playbooks onto the hosts. Below is a sample log on our bootstrapper for EIP management, invoked by an ASG node while it was booting up.

I’ve tested this prototype with one of our VOIP clusters and the cluster is working is perfectly with the corresponding EIP’s as mapped. We terminated the machines, multiple times, to make sure that the EIP management is working properly and the servers are getting bootstrapped. The results are promising and this now motivates us to migrate all of our stateless clusters onto self healing so that our cluster auto heals whenever a machine becomes unhealthy. No need of any Human intervention unless Amazon really screws their ASG :p

It’s almost 2 months since i’ve started playing full time on ansible. Like most of the SYS-Admin’s, ive been using ansible via cli most of the time. Unlike Salt/Puppet, ansible is an agent less one. So we need to invoke things from the box which contains ansible and the respective playbooks installed. Also, if you want to use ansible with ec2 features like auto-scaling, we need to either buy Ansible Tower, or need to use ansible-fetch along with the userdata script. I’ve also seen people, who uses custom scripts, that fetches their repo and execute ansible playbook locally to bootstrap.

Being a good fan of Flask, i’ve used flask on creating many backend API’s to automate a bunch of my tasks. So this time i decided to write a simple Flask API for executing Ansible playbook/ Ansible Adhoc commands etc.. Ansible also provides a Python API, which also made my work easier. Like most of the Ansible user’s, i use Role’s for all my playbooks. We can directly expose an API to ansible and can execute playbooks. But there are cases, where the playbook execution takes > 5min, and offcourse if there is any network latency it will affect our package download etc. I don’t want to force my HTTP clients to wait for the final output of the playbook execution to get a response back.

So i decided to go ahead with a JOB Queue feature. Each time a request comes to my API, the job is queued in Redis and the JOB ID will be returned to the clients. Then my job-workers pick the job’s from the redis queue and performs the job execution on the backend and workers will keep updating the job status. So now, i need to expose 2 API’s first, ie, one for receiving jobs and one for job status. For Redis Queue, there is an awesome library called rq. I’ve been using rq for all queuing tasks.

Flask API

The JOB accepts a bunch of parameters like host, role, env via HTTP POST method. Since the role/host etc.. have to be retrieved from the HTTP request, my playbook yml file has to be a dynamic one. So i’ve decided to use Jinja templating to dynamically create my playbook yml file. Below is my sample API for Role based playbook execution.

Once the playbook file is ready, we need to invoke Ansible’s API to perform our bootstrapping. This is actually done by the Job workers. Below is a sample function which invokes the playbook API from Ansible CORE.

Now, we have a fully fledged API server for executing Role based playbooks. This API can also be used with user data scripts in autoscaling, where in we need to perform an HTTP POST request to the API server, and our API server will start the Bootstrapping. I’ve tested this app locally with various scenarios and the results are promising. Now as a next step, i’m planning to extend the API to do more jobs like, automating Code Pushes, Running AD-Hoc commands via API etc… With applications like Ansible, Redis, Flask, i’m sure SYS Admins can attain the DevOps Nirvana :). I’ll be pushing the latest working code to my Github account soon…

For the past 2 year’s, i played with config management tools like Puppet and Salt. But all these tools were mostly Client-Server Model, except Salt where it supports Push model also. But for the last 6 months, Ansible is gaining more popularity. Ansible is a Push model system which relies on SSH. So before i adopt Ansible completely, i decided to have a try. I need to make sure that the Ansible supports all basic features what other competitors supports. Which is really helpful in migration also.

Installation

Ansible is pretty easy to install. We can install it from source or via package managers or even via PIP.We can use the official ubuntu ppa for installing Ansible.

Since Ansible relies on SSH, things like Host Key verification errors will prevent the SSH connections resulting in failures. We can disable the Host Key Verfication check in the ansible.cfg file

host_key_checking = False # add this option to the config file

or we can set an env variable export ANSIBLE_HOST_KEY_CHECKING=False for the current session. By default ansible uses the hosts file present in the ansible home directory. So we can define the static machines there. We can add either the IP or DNS resolvable FQDN. Once the IP/FQDN is added, we can test the connectivity via ping module. Make sure that the Ansible server’s SSH key is added to the authorized_keys on the remote machines.

Managing Custom Facts

Config management tools like puppet/Salt supports custom facts to be defined on the remote machines. We can define the custom facts and the config management server can use these facts. Even though Ansible is an agentless server, we can define the custom facts on the remote systems. Whenever we query for facts, ansible connects to the remote machines and fetches the facts using its default library. But it also looks for custom facts in /etc/ansible/facts.d/. We need to put our custom facts file in this directory. The file has to be of .fact extension,must be executable and should return a valid JSON. This is in the case of a script. If we just want to define some facts directly, we can simple create a file like below

[myfact]
role=test
profile=staging

The above fact file will add two fact variables called role and profile with the value as mentioned in the file. Now let’s use the system module and see if we are able to retrieve the new custom facts.

Managing Dynamic Inventory

In the Cloud environment, it’s difficult to maintain a static inventory. Ansible does supports Dynamic inventory for vendors including AWS EC2. Ansible provides us an Inventory script. We can also use this script directly and query EC2 to get the list of all instances. To successfully make an API call to AWS, we will need to configure Boto. The simplest is just to export two environment variables:

We can also use regex with these say like tag_Name_test*. For rackspace user’s there is an official module called rax that works perfectly with ansible

Enrcypting YAML Data files

This is an important feature that most of the config management system lacks. In most of the current systems, we need to define the sensitive data like say ssh-keys, API’s AuthID/Token etc… in plain text which increases the security risk. Ansible Vault comes for rescue here. Vault feature can encrypt any structured data file used by Ansible. This can include “group_vars/” or “host_vars/” inventory variables, variables loaded by “include_vars” or “vars_files”, or variable files passed on the ansible-playbook command line with “-e @file.yml” or “-e @file.json”. Role variables and defaults are also included!. While invoking any playbook, we can pass the --ask-vault-pass along the vault password, so Ansible can can decrypt the file and use its contents while performing any execution.

Ansible indeed is truly an awesome product. It does have many new features like vault compared to its competitors. It’s backed by an awesome community. So we can expect more exciting features in future.