I remember when all of this was logs

The great thing about running lots of specialised, separate instances in the cloud, as we do with Amazon’s Web Services, is that machines come and go frequently based on demand, health check rules, or changes to the application code hosted on them. The bad thing about running lots of instances in the cloud is that machines come and go frequently — and with them, so do the valuable logs they generate.

At MetaBroadcast, there are already a few tools in place to monitor instances and make sure everything is working as planned. We have Sensu, which is like a more cloud-friendly version of Nagios; it checks that instances are up and working as expected. If not, it’ll fire off an alert that something might need a bit of inspection.

We also employ Graphite, which generates realtime graphs right from low level (CPU, memory, disks) to the more abstract — and arguably more important — data such as API response times. We review the response times weekly to make sure we’re delivering a fast service, as these may affect a service level agreement with a client.

Here’s a Graphite graph of one of our Mongo database replicasets over a single week.

More generally, services and applications within an instance will generate all sorts of logs constantly. It’s here that we want to make the most improvement, as often these logs are vital in working out why an instance might have failed. By the time it has failed, it has been terminated and the logs lost.

a new challenger approaches

Although Logstash itself isn’t new, a number of things over the past year have greatly increased its usage. Towards the end of last summer, ElasticSearch (you guessed it — the people behind ElasticSearch) acquired Logstash. It’s a very natural fit — Logstash reads server logs and interprets them as JSON, and ElasticSearch is good at searching JSON very quickly.

Around the same time, ElasticSearch also made huge improvements to Kibana. Kibana is a javascript front-end that can query ElasticSearch directly, and allows for an endless amount of customisation based on the information you want to display from all of that lovely data.

Here’s an example of what Apache access log response codes might can like in Kibana. If you have lots of errors.

So now ElasticSearch have an end to end logging system — ElasticSearch storing and sorting the logs generated from the Logstash service, and its all queried and prettified with Kibana. ‘ELK’ for short. And it’s all open source and free. What’s the catch?

how much wood would a woodchuck chuck

The only real caveat here is resources. Logstash itself is quite a hefty application. Since it depends on a JVM and ElasticSearch running behind it, it wants to have lots of memory and it’s not so good at sharing. This isn’t going to work on our instances, as we run some services on smalls and micros, and wouldn’t want to have to put extra memory in place just for logging.

Luckily, this problem already has a solution. Logstash can be split into one central server with ElasticSearch being fed by lots of client log shippers, running on the instances whose logs you want to ship. There are a number of different tools to harvest the logs on an instance you would like to monitor. There’s the semi-official logstash-forwarder (formerly known as Lumberjack, which is a much better name), but the tool I’ve chosen to use is Beaver. It’s very small and lightweight, like the others, but also seems to have the simplest and most Puppet-friendly deployment — a key factor when considering we will be rolling it out to lots of instances. Most importantly, though, it too has an excellent name, and I look forward to unleashing an army of robotic beavers onto instances, to chew through logs and send the nice JSON to one big log dam.

one dam to rule them all

Between Beaver running on the clients, and Logstash, ElasticSearch and Kibana running on a central server, we should have everything we need to get some really interesting realtime logs moving and displaying. The coolest bit is that Kibana allows you to create and save any number of different ‘dashboards’, featuring whatever you like. The idea is that with one central log system, a spike on API response time, or a spike in 500 server errors can be correlated to things such as database load, CPU and memory load or HTTP requests, with just a few clicks.

In the next blog post, I’ll go into more detail about how we’ve connected these bits together and made them work to scale, and show how things go from the logs Beaver reads through, to the tables, charts and graphs Kibana generates to make sense of it all.