Ranting within

Following my pattern of building AMIs for applications, I create my Apache Zookeeper cluster with Packer for my AMI and Terraform for the infrastructure. This Zookeeper cluster is auto-discovering of the other nodes that are determined to be in the cluster

The base playbook installs a base role for all the base pieces of my system (e.g. Logstash, Sensu-client, prometheus node_exporter) and then proceeds to install zookeeper. As a last step, I install exhibitor. Exhibitor is a co-process for monitoring, backup/recovery, cleanup and visualization of zookeeper.

The role itself is very simple. The zookeeper cluster is managed by exhibitor so there are very few settings passed to zookeeper at this point. One thing to note though, this requires an installation of the Java JDK to work.

This means that the node will check itself into a configuration file in S3 and that all other zookeepers will read the same configuration file and can form the cluster required. You can read more about Exhibitor shared configuration on their github wiki.

The instances that run in my infrastructure get a lifespan of 14 days. This allows me to continually test that I can replace my environment at any point. People always ask me if I follow the same principal for data nodes. I posted previously about replacing nodes is an ElasticSearch cluster, this post will detail how I replace nodes in a Riak cluster

NOTE: This post assumes that you have the Riak Control console enabled for Riak. You can find out how to enable that in the post I wrote on configuring Riak.

When going to the Riak Control, you can find the following screens:

Cluster Health

Ring Status

Cluster Management

Node Management

Removing a node from the Cluster

In order to remove a node from the cluster, go to the cluster managemenet screen. Find the node you want to replace in the list and click on the Actions toggle. It will reveal actions as follows:

As the node is currently running, I tend to chose the Allow this node to leave normally option (if the node had died or was unresponsive, I would usually chose the force this node to leave). Clicking on the Stage button, details a plan of what is going to happen:

If the proposed changes look good, Commit the plan. Watch the partitions drain from the node to be replaced:

When the all the partitions have drained, we now have a 2 node cluster where the partitons are split 50:50:

We can now destroy the node and let the autoscaling group launch another to replace it

Adding a new node to the Cluster

Assuming a new node has already been launched and is ready to go into the cluster. Go to the cluster management page in the portal and enter new node details. It should follow the format riak@<ipaddress>

The list of actions that are pending on the cluster:

Commit the changes, watch the partions rebalance across the cluster:

The cluster will return to being 3 nodes, with equal partition split and will then show as green again

In a previous post, I talked about how I have tended towards the philosophy of 'Immutable Infrastructure'. As part of that philospohy, when a box is created in my environment, it has a lifespan of 14 days. On the 14th day, I get a notification to tell me that the box is due for renewal. When it comes to ElasticSearch nodes, there is a process I follow to renew a box.

I have an example 3 nodes cluster of ElasticSearch up and running to test this on:

Let's assume that instance i- was due for renewal. Firstly, I would usually disable shard reallocation. This will stop unnecessary data transfer between nodes and minimise the wastage of I/O.

I can see that it tells me the cluster is yellow and that I have 2 nodes in it. I can proceed with the instance termination.

I have an AWS Autoscale Group configured for ElasticSearch to keep 3 instances running. Therefore, the node that I destroyed will fail the Autoscale Group Healthcheck and a new instance will be spawned to replace it.

Using the ElasticSearch Cluster Health API, I can determine when the new node is in place:

The command will continue running until the cluster has 3 nodes in it. If you want to replace more nodes in the cluster, then repeat the steps above. If you are finished, then it is important to re-enable the shard reallocation:

In my last post, I described how I use Packer and Terraform to deploy an ElasticSearch cluster. In order to make the logs stored in ElasticSearch searchable, I use Kibana. I follow the previous pattern and deploy Kibana using Packer to build an AMI and then create the infrastructure using Terraform. The Packer template has already taken into account that I want to use nginx as a proxy.

This role downloads a private SSL Key and a Certificate from a S3 bucket that is security controlled through IAM. This allows us to configure nginx to act as a proxy. The nginx proxy template is available to view.

We can then pass a number of variables to our role for use within ansible:

This allows me to scale my system up or down just by changing the values in my Terraform configuration. When the instances are instantiated, the Kibana instances are added to the ELB and they are then available to serve traffic

As discussed in a previous post, I like to build separate AMIs for each of my systems. This allows me to scale up and recycle nodes easily. I have been doing this with ElasticSearch for a while now. I usually build an AMI with Packer and Ansible and I use Terraform to roll out the infrastructure

This is just some basic ansible commands to get the apt-repo, packages and plugins installed in the system. You can find the templates used here. The important part to note is that variables are used both in the script and in the templates to setup the cluster to the required level.

Building an ElasticSearch Cluster with Terraform

The infrastructure of the ElasticSearch cluster is now pretty easy. I deploy my nodes into a VPC and onto private subnets so that they are not externally accessible. I have an ELB in place across the nodes so that I can easily get to the ElasticSearch plugins like Marvel and Head.

This allows me to scale my system up or down just by changing the values in my Terraform configuration. When the instances are instantiatied, the ElasticSearch cloud plugin discovers the other members of the cluster and allows the node to join the cluster

I use Autoscaling Groups in AWS for all of my systems. The main benefit for me here was to make sure that when a node died in AWS, the Autoscaling Group policy made sure that the node was replaced. I wanted to get some visibility of when the Autoscaling Group was launching and terminating nodes and decided that posting notifications to Slack would be a good way of getting this. With Terraform and AWS Lambda, I was able to make this happen.

This post assumes that you are already setup and running with Terraform

We assume here, that you have already created a Slack Integration. The hook URL from that integration is required for the lambda contents.

The filename slackNotify.zip is a zip of a file called slackNotify.js. The contents of that js file are available

Terraform currently does not support hooking AWS Lambda up to SNS Event Sources. Therefore, unfortunately, there is a manual step required to configure the Lambda to talk to the SNS Topic. There is a PR in Terraform to allow this to be automated as well.

In the AWS Console, go to Lambda and then chose the Lambda function.

Go to the Event Sources tab:

Click on Add Event Source and then choose SNS from the dropdown and then make sure you chose the correct SNS Topic name:

We then use another Terraform resource to attach the Autoscale Groups to the Lambda as follows:

I've long been a configuration management tool fan. I have blogged, spoken at conferences and used Puppet as well as Chef and Ansible. The more I use these tools now, the more I realise I'm actually not making my life any easier

Currently, the infrastructure I manage is 100% AWS Cloud based. This has actually changed how I work:

I have learned to always expect problems so I therefore should have everything 100% automated.

No server is kept in production for more than 2 weeks

By combining these 2 ways of working, I can easily recover from outages. The speed of recovery is down to being able to provision the pieces of my system as fast as possible. The simplist way to be able to provision instances fast is to build my own AMIs with Packer. I have come to the realisation that when I boot an instance, I don't really want to wait for a configuration management tool to run. I have also begun to realise that having a tool change my systems in production can introduce unneeded risk. The Packer templates to build the AMI have serverspec tests built into them. This means that at build time, I know if an AMI has been built correctly.

The AWS infrastructure itself is managed by Terraform. I tend to use AutoScalingGroups and LaunchConfig for the instances and when Terraform is checking the state of the infrastructure, it will look up the latest AMI ID and make sure that it is part of the Launch Configuration. If it isn't, Terraform will update the Launch Config so that the next machine will be booted from the new AMI.

I use Rundeck for orchestrating changes to the infrastructure. I have a job in Rundeck that allows me to recycle all instances in an AutoScalingGroup one at a time and in a HA manner. From building a new AMI, to fully recycling an AutoScalingGroup is about 20 minutes (the packer build itself takes about 12 minutes). So, in theory, it takes me about 20 minutes to release new security patches to all instances in an AutoScalingGroup.

Isn't this just 'Golden Images'?

Technically, yes. But the important for me is being able to roll out a fully tested AMI and then not making any additional changes to it in production. I would like to say that my infrastructure is 100% immutable, but after reading a recent article by @emmajanehw, I now realise that can never be the case. Each of my AMIs are versioned and I have a nightly Rundeck job that tells me what version of an AMI a system is built / released with.

Do I Consider Configuration Management Dead?

Not at all. I simply do not want to make additional changes to my environments when I know they are working. Right now, I use Ansible to provision my AMI as part of my Packer scripts. So I do believe these tools still need to be part of our eco-system. I could substitute in any configuration management tool to help build my AMIs. The purists could even use bash / shell scripts to do the same job

Can I only do this if I use *nix / AWS?

Not at all. At $JOB[-1], we actually were changing our provisioning to allow us to spin up images much faster. We were using a mix of AMIs and VMWare templates for Windows and Ubuntu. By moving in that direction, it would reduce the time taken to provision a box from maybe an hour to minutes.

In my opinion, moving to a more immutable style of infrastructure is the next phase of infra management for me. I believe the learnings from using config management tools in production across 1000s of nodes has helped me move in this direction but YMMV.

Firstly, I'd like to say that this is not about naming and shaming. Secondly, I am not annoyed with the conference at all about the response. The conference I spoke to advertises itself as “engineering talks only” so I wanted to post a few things about that.

In my opinion, the writing of code and the ecosystem of a specific platform is only 10% (or rather a small portion) of what we need to be aware of as software engineers. I am a software developer who works in the infrastructure / ops world now. When I was writing application only code, I was not involved in understanding the entire ecosystem of the software I was working on. In hindsight, I really feel I missed out by not being part of it. Since being part of the infrastructure world, I feel it has actually helped me develop better & more robust software.

Organising conferences is a huge amount of work and is, frankly, hard. I understand that conferences cannot cater for every part of an ecosystem. One thing I do think conferences should do, is to strive to make developers better. DevOps, Continuous Delivery and Infrastructure are / should be things that we, as developers, care about. To dismiss these style of topics from a conference that advertises for “engineering talks only” can help to hinder developers from delivering the best products they can. It may not also make developers understand the importance of software being in production and making money