My Life as a Sys Admin

Monthly Archives: June 2015

We have been using Ansible for all our Deployments, Code Updates, Inventory Management. We have also built a higher level API for Ansible called bootstrapper. Bootstrapper provides a rest API for executing various complex tasks via Ansible. Also, there is a simple python CLI for bootstrapper called bootstrappercli which interacts with the bootstrapper servers in our prod/staging env and performs the tasks. We have been using slack for our team interactions and it became a part of my chrome tabs always. We also have a custom slack callback plugin, that provides us a real time callback for each step of playbook execution.

Since bootstrapper is providing a sweet API, we decided to make Slack to directly talk to Ansible,. After googling for some time, i came across a simple slackbot called limbo, written in Python(offcourse) and the best part is its plugin system. It’s super easy to write custom plugins. As an initial step, i wrote a couple of plugins for performing codepush, adhoc and EC2 Management. Below are some sample plugin codes,

Configure Limbo as per the Readme present in the Repo and start the limbo binary. Now let’s try executing some Ansible operations from Slack.

Let’s try a simple ping module

Let’s try updating a staging cluster,

WooHooo, it worked 😉 This is opening up a new dawn for us in managing our infrastructure from Slack. But offcourse ACL is an important factor, as we should restrict access to specific operations. Currently, this ACL logic is written with in the plugin itself. It checks the user who is executing the command and from which channel he is executing, if both matches only, the Bot will start the talking to Bootstrapper, or else it throws an error to the user. I’m pretty much excited to play with this, looking forward to automate Ansible tasks as much as possible from Slack 🙂

Now a days its very difficult to contain all the services in a single box. Also, companies have started adopting micro service based architectures. For Docker user’s, they can achieve a highly scalable system like Amazon ASG using Apache mesos and their supporting frameworks like Marathon etc… But there are a lot services, where in a new node needs to know the existing members in a cluster to join successfully. Common method will be having a static ip’s to the machines. But we have started moving all our clusters slowly to become immutable via ASG. So either we need to go for Elastic IP’s, and use these static ip’s, but the next challenge would, what if we want to scale the cluster up. Also, if these instances are inside a private network, assigning static local ip’s is also difficult. So if a new machine is launched or if asg relaunches a faulty machine, these machine’s should know the existing cluster nodes. But again not all applications provide service-discovery features. I was testing around with a new Erlang based MQTT broker called verne. For a new Verne node to join the cluster, we need to pass the ip’s of all existing nodes to the JOIN command.

So a quick solution came to mind was using a Distributed K-V pair system like etcd. Idea was to use etcd to have a key-value that can tell us the nodes present in a cluster. But quickly i found another hurdle. Imagine a node belonging to a cluster, went down, the key will be still present in etcd. If we have something like ASG, will launch a new machine, so this new machine’s IP will not be present in the etcd key. We can delete manaully and launch a new machine, then add its IP to the etcd. But our ultimate goal is to automate all these, otherwise our clusters cannot become a full immutable and auto healing.

So we need a system, that can keep track of our Node’s activity atleast, when they join/leave the cluster.

SERF

Enter SERF. Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Serf relies on an efficient and lightweight gossip protocol to communicate with nodes. The Serf agents periodically exchange messages with each other in much the same way that a zombie apocalypse would occur: it starts with one zombie but soon infects everyone. Serf is able to quickly detect failed members and notify the rest of the cluster. This failure detection is built into the heart of the gossip protocol used by Serf.

Serf, also need to know IP’s of other Serf Agents in the cluster. So initially when we start the cluster, we know the IP’s of the other Serf agents, and the serf agents will then populate our Cluster Key’s, on ETCD. Now if a new machine is launched/relaunched, etcd will use the Discovery URL and will connect to the ETCD Cluster. Once it has joined the existing cluter, our Serf agent will start and it will talk to the etcd agent running locally and will create the necessary key for the node. Once the Key is being created, Serf Agent will join the serf cluster using one of the existing nodes IP that was present in the ETCD cluster.

ETCD Discovery URL

A bit about etcd Discovery URL. It’s an inbuild discovery method in etcd for identifying the dynamic members in an etcd cluster. First, we need an ETCD server, this server won’t be a part of our cluster, instead this is used only for providing the discovery URL. We can use this server for multiple ETCD clusters, provided each cluster should have a unique URL. So all the new nodes can talk to their unique url and can join with the rest of the etcd nodes.

$ curl -X PUT 172.16.16.99:4001/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/_config/size -d value=3
where, 6c007a14875d53d9bf0ef5a6fc0257c817f0fb83 => a unique UUID
value => Size of the cluster, if the number of etcd nodes that tries to join with the specific url is > value, then the extra etcd processes will fall back to being proxies by default.

SERF Events

Serf comes with inbuilt event handler system. It has some inbuilt event handlers for Member Events like member-join, member-leave. We are going to concentrate on these two events. Below is a simple handler script, that will add/remove keys in ETCD, whenever a node joins/leave the cluster.

Design

Now we have some vague idea of serf and etcd. Let’s start building our self-discovering cluster. Make sure that the etcd Discovery node is up and a unique URL has been generated for the same. Once the URL is ready, lets start our ETCD cluster.

Serf has successfully created the key. Now, let’s terminate one of the node’s. one of our Serf agent will immidiately receives an EVENT for member-leave and it will start the handler. Below is the log for the First agent who received the event.

Once the node joins the cluster, the other existing serf nodes will receive the member-join event, and will also tries to execute the handler. But since the key is already being created, it wont recereate it.

So now we have IP’s of our active nodes in the cluster in etcd. Now we need to extract the IP’S from etcd and need to use them with our Applications. Below is a simple python script that returns the IP’s in as a python LIST.

We can use ETCD and confd to generate the config’s dynamically for our applications.

Conclusion

ETCD and SERF provide a powerfull combination for achieving a self-discovery feature onto our clusters. But it still needs heavy QA testing before it’s being used in production. Especially, we need to test on the confd automation part more, as the we are not using a single KEY-VALUE, as the SERF creates a separate K-V for each node. And the IP discovery from ETCD is done by iterating the key-directory. But these tools are backed by an awesome community, so i’m sure soon these tools will helps us to achieve a whole new level for clustering our applications.