December 5, 2015

Introduction

If you’ve ever had the misfortune of using a computer before, you’ve probably encountered what it’s like when a computer does something wrong. When you have a lot of computers working together doing a lot of computer-things (several to a lot of those things being wrong), it helps to have logs of what’s been going on to help with troubleshooting, and it really helps to have those logs in some kind of centralized location. And if you’ve ever run into this kind of situation at work, then you’ve probably had to figure out how to set up and manage some kind of logging solution.

When I was in just this situation in a previous lifetime, back when I was a one-person ops team, I was struggling with the build versus buy versus open-source decision. I really wanted to try to set up an ELK (Elasticsearch, Logstash, and Kibana) cluster, but given that I was a one-person team, I didn’t have the time or the resources to figure it all out on my own from scratch. The information that I found online seemed mostly aimed at developers, not ops folks, none of the community Chef cookbooks worked off the shelf, and with everything else on my plate I didn’t have the time to make any of it work for me.

But now I’m at Etsy working on an incredible ops team with enough resources to dedicate to ELK wrangling to get it right, and we’ve gotten it operationalized and automated to the point that it’s very nearly as easy as “press button get ELK cluster.” My goal for this post is to write the one I wish I could have read so many years ago, with the hope that it will be helpful to anyone else who has to set up and manage an ELK cluster. This won’t cover what ELK is, how it compares to other logging solutions, or how to get it set up manually - those topics have all been covered elsewhere. What I will talk about is how we have our clusters set up in Chef, how we monitor and manage the clusters on a day-to-day basis, and how we handle things when the herd of ELK goes all sideways.

Press Button, Get ELK Cluster: Choosing and Provisioning Hardware

At Etsy, we use ELK pretty extensively for the vast majority of our log storage and searching needs. Because snowflake servers make us sad, all of our ELK roles are in Chef. In this repo you can see examples of the roles, recipes, templates, and data bags that we use to create a cluster. With this combination of roles and recipes, building a new cluster is as simple as provisioning nodes with the Elasticsearch, Logstash, and Kibana roles - when they come up, all the pieces are already working together and ready for log ingestion. For our large production cluster, each of the E, L, and K components is on its own dedicated hardware, but if your hardware is limited, all three roles can be combined on the same servers.

A few things that we’ve learned over the years with regards to provisioning ELK clusters:

Hardware matters! Specifically, disk IO is very important to ELK performance. We’re using SSDs for all of our ELK clusters. If you have your own data centers and are also using SSDs, make sure you choose SSDs with enough write performance to handle the load you expect on your cluster (this is often noted as something like DWPD or disk writes per day) - SSDs that are designed to handle read-intensive loads will die a lot earlier if you are writing hundreds of GB per day to them. As an example, we ingest between 7 and 10 TB of logs daily, and keep them available for 30 days - to support this without falling over, our production cluster has 70 Elasticsearch, 10 Logstash, and 2 Kibana servers, with the Elasticsearch servers using SSDs that are rated for write-intensive workloads.

Network bandwidth matters too! With a main production cluster that ingests a several TB of logs per day, we found that we absolutely had to have 10GbE network cards in these servers rather than 1GbE cards. This was for making sure that node failures didn’t cause data loss - with our log volume, recovery from a node failure with a 1gig NIC would take multiple to tens of hours to recover, slowing down the cluster’s performance as well as increasing the window of time during which a second node failure might cause data loss. With 10gig NICs, recovery and rebalancing operations take much less time - again this is with our 7-10TB of logs per day, so if your expected daily and overall log volume are much lower, 1gig might be fine for you. This applies in the cloud too - if you’re using something like AWS, make sure you pick instances with enough network bandwidth available to handle your expected load.

By default, the Elasticsearch template that ships with Logstash will pull out and index every single possible field in every log line - we found that this had a noticeable negative impact on cluster performance with our log volume, and since we generally had only a few fields that we really needed to index on, decided to Chef out a new template that turned off this default behavior. Your mileage may vary, but this is something else to look out for if you’re expecting to have a relatively high log volume (anything under a terabyte daily and you’ll probably be fine with most of the defaults).

We also use Chef to install the Lumberjack log forwarder (our internal fork of logstash-forwarder) to client nodes that need to get logs into logstash. The recipe to install and set up Lumberjack is also found in the ELK-utils repo, and then log files are configured to be ingested by declaring them as attributes in roles:

Adding a new log and index to one of our existing ELK clusters involves making changes in three places: The Chef role, where the log and its type are specified, the logstash.json configuration file, which is cheffed out in our Logstash cookbook and defines which grok patterns and indices are used for that log type, and the grok pattern (if a new one is being defined), which is also part of the Logstash cookbook. If you’re ever writing grok patterns, the Grok Debugger is incredibly helpful - don’t go elk herding without it!

With everything in Chef, getting up and running with a new cluster is something that can happen in a matter of hours, from building the boxes to searching for your newly ingested logs in Kibana. But that’s not all that has to be considered - ELK clusters generally need a fair amount of love to keep them up and running happily.

Day to Day Operations: Chatops to the Rescue

Because we’re big into IRC and chatops at Etsy, we’ve taught our friendly local irccat a few things about ELK so that we can use it to help troubleshoot and manage the cluster in chat. Most of the stability and maintenance issues we’ve run into so far have involved the Elasticsearch part of the stack, so most of our tooling has been built to help us with that aspect. One of the nice things about Elasticsearch is its API, but one of the less-than-nice things about that API is having to remember all the different endpoints you might need to curl to get the information you need. Luckily, computers are here to remember things for us!

We approached this by taking the most common API endpoints we found ourselves using and writing irccat commands to do them for us. You might find that you end up using different ones more often, but some of the commands that we’ve created and found very helpful include:

Seeing the current health of the cluster

Finding which of the eligible servers is the current master

Getting a list of the current indices in general…

... or getting more detailed information about indices with a given name, including color-coded output for easier readability

Listing any current recoveries ongoing in the cluster

Seeing the status of all the shards in the cluster

We also have the ability to modify a few settings from IRC, such as the maximum number of concurrent rebalancing operations that can be in progress at any one time - we’ll normally have this set relatively low, to 1 or 2, but when the cluster is recovering from some event or we’re adding new nodes to it, we’ll turn this up to 7 or 8 to allow the cluster the rebalance itself as quickly as possible.

When Things Go Sideways: Monitoring, Troubleshooting, and Maintenance

With this combination of commands, as well as dashboards that pull data from both Ganglia and Graphite to give us near-realtime looks into how each of the various sub-clusters of our ELK cluster are doing, we’ve found ourselves able to quite handily troubleshoot most failure cases that have come our way so far. ?elk master quickly lets us know which server is the current Elasticsearch master, giving us a good starting point for troubleshooting - most often in the elasticsearch logs for the master ES process. Those logs will point us to individual servers in the cluster that are causing problems - the bulk of the issues we’ve seen so far have been the cluster taking too long to mark unresponsive nodes as failed and remove them from the cluster, which was fixed with our upgrade to Elasticsearch 1.5.2. With this upgrade we also saw the cluster healing faster in general, so if you’re starting a new cluster, we’d highly recommend going with the latest and greatest stable version from the get-go.

Our dashboards are created using Etsy’s own dashboard framework and are used to view both system-level metrics, collected via Ganglia, and application-level metrics that are gathered and put into Graphite.

While we did experiment with using Elasticsearch’s Marvel plugin for monitoring, we ended up not being too keen on having Elasticsearch running in order to monitor Elasticsearch - having our monitoring be completely decoupled from what is being monitored has been working much better for us, and we’re quite happily using things like Elasticsearch’s API to generate all the metrics we need, but Marvel is certainly worth a look - your mileage may vary.

The one major issue we’ve had relates to SSDs - a firmware upgrade was required on every single SSD in our production Elasticsearch cluster - and we wanted to get that done without any downtime! The manufacturer we were working with could only replace 66 disks at a time, so with 6 disks per ES node, we were cycling batches of 11 servers in and out of the cluster - just under 8 batches in all. This has been the biggest operational headache we’ve had in terms of managing ELK in production, but the process of working through that issue led to the creation of the ELK-utils repository and many of our ELK management irccat commands - so it wasn’t a complete disaster!

How big of a headache was this? Elasticsearch lets you specify which nodes in a cluster are included and excluded for a given index with a simple API call, so to remove a server from the cluster, all we had to do was add that server to the exclude list for… every single index. With close to 50 indices in the cluster at a time (we index our production logs daily) and upwards of 70 servers, math shows us that that’s a lot of API calls to make. So we turned to our faithful friend automation to do it for us, and thus the elkvacuate script was born. With this script we were able to generate and execute all those API calls automatically, so 500+ curl commands was reduced to 11 runs of this script to add or remove one batch of servers (we purposely removed servers one at a time to avoid overwhelming the cluster with rebalancing operations, which happened when we tried doing 11 at once).

Conclusion

There’s a lot I covered in here, and even more that I didn’t cover, because it would take a blog post the size of an elk to go into detail on everything related to managing and operating an ELK cluster. If I could leave you with a few pieces of wisdom, they would be: pay attention to your (expected) capacity before you start setting things up, monitoring and configuration management are your friends, and JSON-based APIs are a lot friendlier when you can automate away most of the nitty gritty details.

Thank You

Special thanks to Ben Cotton for editing, Ryan Frantz for additional editing and general awesomeness, and everyone who read this! If you’d like to chat more about shaving ELK-yaks, you can find me on the internets.