Elastic stack for Root Cause Analysis at Mapp

We are a small Root Cause Analysis (RCA) team at MAPP, and we are using Elasticsearch, Logstash and Kibana (Elastic stack) for logfile analysis. Logfiles have always been our friends, but at some point the standard Unix command line tools that we used before didn't suffice anymore, which is when we realized that we need a more powerful tool. We built a cluster and pumped lots of logs into it. Today, our biggest cluster holds 8.5 TB of data, giving visibility across applications and locations. The Elastic stack is changing the way we work, and it's spreading to ever more applications and teams.

The vitality of the Elastic technology stack

Different modules of our software produce their own logs, in different formats. Previously, we did all those dirty grep, sed, awk moves to extract certain patterns out of the raw logs and then do analysis on this. It was absolutely difficult, and it was impossible to see any trends based on different parameters. When we started using the power of the Elastic Stack, the whole log analysis became faster, more proactive, meaningful, and accurate. With Elasticsearch we can now do full text search in near real time. Using Logstash and its powerful plugins, we can deal with different types of logs from different software modules and extract valuable fields out of logs. Kibana gives us the power of creating and using dashboards as well as extracting the analysis results based on the logs. Also, passing the information from team to team became easy by sharing dashboards. Plugins such as Kopf, Curator and Marvel help us to manage Elasticsearch and its indices.

The Elastic technology stack from head to toe

When we say that we use the Elastic Stack, this means that we are covering the full logfile analysis story, content and technology. Let's start with the application that produces the logs. We know it and the people behind it because of our Root Cause Analysis (RCA) role. So we can give recommendations on the content of the logfile, and also on the structure. Next we make this visible in the Elastic stack, either by feeding it into an existing cluster or by creating a new stack. In the latter case, it's us who set up Elasticsearch, Logstash and Kibana. Then we go back to the users to ask how this works for them. Working with the Kibana dashboards, they will start to think differently and come up with new use cases. We'll keep developing the Elastic Stack, configure Logstash, do the sizing, optimize and upgrade all components.
The Elastic Stack users are Third Level Support, DevOps, Developers and Product Management. And us. This is very important, there's no better way to understand your user than being one yourself.

A powerful combination: Elastic Stack and RCA

In this post, I'll argue that this is an amazingly powerful combination: RCA, the Elastic Stack and a full team.

For one, as RCA, we've always been bridging teams and departments. We talk to everyone who is affected by incidents, and to everyone who knows about their causes and helps to prevent them - in short, to everyone.

Second, we have a vital interest in the Elastic Stack, using it a lot, while at the same time the Elastic Stack has a great "Power to the User" built in philosophy that makes it possible to answer questions from many perspectives.

Third - let me take one step back. Jez Humbel wrote in this article that "Bad behavior arises when you abstract people away from the consequences of their actions." In this spirit, we don't tell people "Don't worry, we take care of this", but aim at bringing people back in touch with the consequences of their actions. The Elastic stack is a great tool for this.

Fourth, the Elastic Stack is flexible, so we are, too. Every component, Elasticsearch, Logstash and Kibana, adapts very well to many situations. It processes all kinds of input and lets you combine this with other data sources. For example, we extract information on human readable names from a database, build a dictionary, then use the Logstash translate plugin to add this information to our events.

Attack of the Buzzwords

Now, what happens when you have a team with the mission to identify and analyse incidents and the tool and the knowledge to spread this across the company? I'd like to show how this relates to Agile, Microservices, DevOps and Lean - sorry about this avalanche of buzzwords, but I promise that I have a reason to use them.
The Elastic Stack is a natural microservice. It is small, it serves one purpose, and it is deployable independently. We are so lucky. Enjoy it, use it, and build a team that can handle it. Some members who are close to the content and interpretation, like experienced RCA people, and one or two DevOps. This team is a cross-functional group of people that have everything, and everyone, necessary to get your Logstash analysis flying - and is an agile team by this definition.
There will be requests for more Elastic Stacks springing up all over the place. You have one in production, you need one for smoke tests. You have one for this applications, you need one for another one, too. You have one in EMEA, you need one in the U.S.. And so on.

This in turn means that your team will be setting up Elastic Stacks in all kinds of environments. If server setups differ, they will be feeling it. If the configuration is not the same, this will show in the dashboards. If automation isn't simple, they'll be slowed down. They say that you can recognize the vanguard by the arrows sticking out of their breast - let's have a look, yes, there's a number. But this is what a vanguard is for. We push for smoother deployment processes and automation, and we cross department borders. We aren't the only ones, of course, but we have a good chance to succeed because our service is so small. When we do, this paves the road for a more agile culture.

We've already dealt with most those buzzwords, except for "Lean". This comes in when you look at the Build-Measure-Learn Circle:

Full presentation of Eric Ries can be viewed here.
Focus on the lower left: From "Measure" to "Learn". There tend to be gaps. The Elastic Stack / RCA cannot fill all of them, but help some. The Elastic stack is good for "Measure Faster", and RCA is good for "Learn Faster”. And this is something worth working for.

Outlook

We are constantly working on our Elastic Stacks, presently implementing Shield and Watcher. Shield is important to enforce multitenancy, while retaining all the data in one cluster to facilitate analysis across customers. Watcher will help us to give special attention to issues, new workflows or jumpy customers. There are more adventures waiting for us with the Elastic stack.

Sun-Tsung Kim started working with Elasticsearch end of 2012 at Autoscout24 and has been working with it ever since. She has a background in statistics, programming, OS and database administration, various data stores like meetups or coursera. Additionally Sun-Tsung is interested in new ways to work and organise.