A blog about DevOps, gaming, parenting, general geekery, or whatever else happens to pop to the top of my running processes

logstash

Overview

I’ve been dabbling a bit more with OpenStack as of late. If you know me, you can likely guess my goal is how to ingest logs, monitor resources, etc.

I’ve been trying to see how well Ceilometer, one of the core components of OpenStack that actually provides some of this stuff, would work. Initially, I was a bit bummed, but after fumbling around for a while, I am starting to see the light.

You see, the reason I almost abandoned the idea of using Ceilometer was due to the fact that some of the “meters” it provides are, well, a bit nonsensical (IMHO). For example, there’s network.outgoing.bytes, which is what you would expect… sort of. Turns out, this is a “cumulative” meter. In other words, this meter tells me the total number of bytes sent out a given Instance’s virtual NIC. Ceilometer has the following meter types to measure :

Maybe I am naive, but seems quite a bit more helpful to track this as a value based on a given period… you know, so I can get a hint of how much bandwidth a given instance is using. In Ceilometer parlance, this would be a delta metric.

I’ll take this aside and defend the fine folks working on Celiometer on this one. Ceilometer was built initially to generate non-repudiable bulling info. Technically, AFAIK, that is the project’s only goal – though it has morphed to gain things like an alarm framework.

So, now you can see why network.outgoing.bytes would be cumulative: so you can bill a customer for serving up torrents, etc.

Anyway, I can’t imagine that I’m the only person looking to get Delta metrics out of a Cumulative one, so I thought I’d document my way of getting there. Ultimately there might be a better way, YMMV, caveat emptor, covering my backside, yadda yadda.

Receiver: This is the system outside Ceilometer that will receive what the Publisher sends (Logstash for me, at least at the present – will likely move to StatsD + Graphite later on)

Store

While tangental to this post, I won’t leave you wondering about the “Store” part of the pipeline. Here are the storage options:
* Default: Embedded MongoDB
* Optional:
* SQL
* HBase
* DB2

Honorable Mention: Ceilometer APIs

Like pretty much everything else in OpenStack, Ceilometer has a suite of OpenAPIs that can also be used to fetch metering data. I initially considered this route, but in the interest of efficiency (read: laziness), I opted to use the Publisher vs rolling my own code to call the APIs.

Working the Pipeline

There are two Transformers (at least that I see in the source):

Scaling

Rate of Change

In our case, we are interested in the latter, as it will give us the delta between two samples.

To change/use a given Transformer, we need to create a new pipeline via /etc/ceilometer/pipeline.yaml

The “cpu_pipeline” pipeline gives us a good example of what we will need:

A name for the pipeline

The interval (in seconds) that we want the pipeline triggered

Which meters we are interested in (“*” is a wildcard for everything, but you can also have an explicit list for when you want the same transformer to act on multiple meters)

The name of the transformation we want to use (scaling|rate_of_change)

Some parameters to do our transformation:

Name: Optionally used if you want to override the metric’s original name * Unit: Like Name, can be used to override the original unit (useful for things like converting network.*.bytes from B(ytes) to MB or GB)

Type: If you want to override the default type (remember they are: (cumulative|gauge|delta))

Scale: A snippet of Python for when you want to scale the result in some way (would typically be used along with Unit)

Side note: This one seems to be required, as when I omitted it, I got the value of the cumulative metric. Please feel free to comment if I goobered something up there.

Looking at all of this, we can see that the cpu_pipeline, er, pipeline:

Multiplies the number of vCPUs in the instance (resource_metadata.cpu_number) times 1 billion (10^9, or 10**9 in Python)

Note the “or 1”, which is a catch for when resource_metadata.cpu_number doesn’t exist

Divides 100 by the result

The end result is a value that tells us how taxed the Instance’s is from a CPU standpoint, expressed as a percentage.

Bringing it all Home

Armed with this knowledge, here is what I came up with to get a delta metric out of the network.*.bytes metrics:

In this case, I’m taking the network.incoming.bytes and network.outgoing.bytes metrics and passing them through the “Rate of Change” transformer to spit a gauge out of what was previously a comulative metric.

I could have (and likely will) taken it a step further and used the scale parameter to change it from bytes to KB. For now, I am playing with OpenStack in a VM on my laptop, so the amount of traffic is small. After all, the difference between 1.1 and 1.4 in a histogram panel in Kibana isn’t very interesting looking :)

Oh, I forgot… the Publisher. Remember how I said the UDP Publisher uses msgpack to stuff its data in? It just so happens that Logstash has both a UDP input and a msgpack codec. As a result, my Receiver is Logstash – at least for now. Again, it would make alot more sense to ship this through StatsD and use Graphite to visualize the data. But, even then, I can still use Logstash’s StatsD output for that. Decisions, decisions :)

Since the data is in Logstash, that means I can use Kibana to make pretty charts with the data.

Here are the bits I added to my Logstash config to make this happen:

input {
udp {
port => 31337
codec => msgpack
type => ceilometer
}
}

At that point, I get lovely input in ElasticSearch like the following:

Oh crap. Another post about logging…

Yep, get over it. This time, however, I’m not here to piss and moan about how much logging stinks. Rather, I want to show the power of extracting data from the logging you have.

Logstash + ElasticSearch + Kibana = Logging Bliss!

I’m not necessarily treading any new ground here, but for the initiated, let me quickly go over what is becoming a very popular logging technology stack in the DevOps space:

Logstash

ElasticSearch

Kibana

I will get into some of the deeper details later, but let’s take a quick tour of the components first.

Logstash

Logstash is the brain child of Jordan Sissel. At a high level, the goal of Logstash is to chew up your logs and spit out contextualized data from what would otherwise be blobs of text. Said another way, instead of mountains of plain-text log files, you get the same log content split out and stuffed into fields you define. This allows you to do all sorts of things with the logs, not the least of which is to perform focused searches in Kibana for exactly the data you need.

Going with the lumberjack analogy, think of Logstash as your lumber mill that takes your logs and spits out 2x4s, etc. There’s a reason why you don’t see tree trunks at Home Depot. Tree trunks do me no good. Studs and plywood, now that I can work with.

ElasticSearch

In this stack, ElasticSearch operates as our super-fast “database” to hold what Logstash kicks out (technically, it’s a search index, but let’s not nerd out too much here). This description sells both ElasticSearch and Logstash short (Logstash can output in many different formats, not just ElasticSearch), but I’m trying to stick to the solution at hand.

ElasticSearch is based on Apache Lucene – just like Apache Solr – and lives to gobble up data broken into fields, then query large volumes of that data near instantly. It also gives a rich query syntax, which Kibana takes advantage of.

Kibana

Kibana is where we see the fruits of Logstash’s and ElasticSearch’s efforts. Kibana is a slick web UI that queries ElasticSearch for various purposes:

As an example on the searching, think of the last time you had to visit your Apache or Tomcat access logs to play Columbo for some issue that happened. I am sure it was an exercise of grep, sed, awk, etc to find that needle in the haystack. Even then, you end up with alot of stuff that is likely not relevant (heaven forbid you need to look for common words like “post,” “get,” or “user”). Now you can say things like, “Show me all the log statements where”:

‘user’ shows up in the referring URL

The user got a 500 and the server took more than 500ms to fulfill the request

And here’s a cool benefit: Normal human beings can now have access to logs with a “type what you’re looking for” interface, so, I dunno, maybe they can leave you alone to do more interesting things?

What is interesting about Kibana is that there is not a server-side component. Kibana is all client-side code. All you need is a web server to pass along Kibana’s code to the browser, then the browser issues queries to ElasticSearch directly via ElasticSearch’s APIs.

While Kibana has a built-in dashboard for Logstash that allows you to search your logs, where things really get interesting is applying Kibana’s Panels to display data from your logs. Think of Panels as the Lego from which you construct your dashboard. Behind each of these is an ElasticSearch query that populates that Panel.

While I will cover all of the Panels in my dashboard, here is a teaser:

Typically the best way to learn something is through an example. The primary reason I am writing this post is to help others take advantage of this tech stack, which I hope to do by walking though what I did – and shouldn’t have done – to get my Kibana dashboard in place.

Recall that the pipeline is like so:

Logging source (syslog/file/flume/etc) => Logstash => ElasticSearch

“Wait, where’s Kibana?” you might be asking. Remember that Kibana is just the UI to visualize data that Logstash, um, stashed in ElasticSearch. I’ll definitely cover Kibana in detail, but we need to start with Logstash, as that is where we add meaning to the text of our logging. Without that context, Kibana will be of limited use to us. Garbage in, garbage out – right?

Magic ain’t Free

Again, I am selling its full capabilities way short, but Logstash’s job here is to take a single line of text and – based on filters, either pre-defined or custom – assign chunks of that text to fields we define. We will use these fields later on not only to more intelligently query our logs, but to target specific data to be shown in Panels in Kibana.

Be the logs

Know the end goal

Grok grok (no, that’s not a typo)

Iterate, because you ain’t gonna get it right the first time

This will take work, but to crib from Alton Brown, “Your hard work will be rewarded.”

Be the Logs

To really get the value out of this stack, you need to know your logs, as well as what you want to get out of them.

At this stage, we just want to look at our target logging and see what kind of data might be useful. So, do the obvious: copy a typical log statement into a text editor and look at it in isolation. Logstash doesn’t have the benefit of context. It’s going to chop up those logs one line at a time, hence I find it best to start my analysis by looking at just one log statement.

As mentioned in #4 above, you are likely to find that your logging is not as “normalized” as you think, so you’ll likely have to revisit this step quite few times along your journey.

There is the following useful/interesting information in that single log statement (in order shown):

Client IP address

Datestamp

Request URL

HTTP status code

Referring URL

User Agent string

Response time

Of those, are there any items that are throwaways (e.g.: data that would not be interesting/useful as search or graphing criteria)? In this case, I would argue not, but especially with logging in custom applications, you can end up with some items that might be interesting when viewed with the log statement itself, but might not be something that we would need to filter/chart (e.g.: PID).

That said, at this point, err on the side of capturing everything you can. If you ultimately determine you don’t need that data, you can simply bit-bucket it via your grok pattern and/or Logstash config.

Similarly to knowing what meaningful data lives in your logs, it helps to know what Kibana can visualize for you. Here are the Panel types available (at the least the non-deprecated ones) at the time I am writing this:

Bettermap: Uses Longitude and Latitude coordinates to place a “pin” on a map

Column: A “wrapper” panel of sorts; used to control the layout of panels in a dashboard

Derivequeries: This panel auto-creates facets for a given field (e.g.: might create multiple queries for each HTTP status code seen in the current data set)

Histogram: A chart to show the total, count, mean, etc number of log statements in the timerange specified

Hits: Total count of lines matching a given query/set of queries

Terms: A “Top X” list of the most common terms found in log statements

Trends: Based on a user-configurable time frame, compare the data shown to the same data across the specified timeframe

Example: I am looking at from 13:00-14:00 and I want to know – as a percentage – how many more/less 404’s I have seen from 13:00-14:00 today vs 13:00-14:00 yesterday)

Know the End Goal

“Wait. Shouldn’t I start with this step?” Sure, you can definitely make that argument, but in many cases you have little control over what logging you get – which is often the case if you are not the one developing the code or controlling the deployment of the app.

As such, my philosophy is to see what I have to work with first. Then, I set my target for the final product.

YMMV on when you this step, but you definitely need to perform this step before you sit down to craft grok patterns.

Continuing with our Apache access log example, here are some potential target items we would want to craft grok patterns around – and why:

Search Filters:

All of these fields

Charting/Trending items:

Client IP address:

Top XX IP addresses

Geolocation of client

HTTP status code:

Histogram of unexpected statuses

User Agent String:

Client OS

Client browser

Response Time:

Histogram of mean response times (yes, I know that’s not statistically the best measure, but it’s what Kibana gives us. There are no confidence intervals, etc.)

Do you Grok Me?

Logstash comes with some built-in filters (called “grok patterns“). These cover common items like IP addresses, timestamps, and even common log types such as httpd access logs.

Using the Lego analogy again, most logs can be parsed using these out-of-the-box grok patterns. You can check out the full list in the Logstash GitHub.

As with most things, odds are your logging might not dovetail perfectly with these patterns and you will have to create your own. But, hey, that’s job security, right? ;)

Creating your own Grok patterns requires some heavy lifting with regex, on which grok patterns are based.

No! Don’t go! We’re just getting to the good part.

I know regex is a four-letter word for most, but in this case, we don’t have to write one massive regex for a given log statement. In fact, depending on your logging data, you likely won’t need to do any regex work at all, as Logstash comes with grok patterns for lots of things you commonly find in logs like:

Timestamps (including days, months, etc)

IP addresses

Class names (com.example.my.custom.class)

Quoted Strings (like requested and referring URLs in Apache logs)

Usernames

UUIDs

MAC addresses

In my case, I had to do some customizations because:

The built-in grok pattern for “program” – which is a sub-pattern referenced by the SYSLOGBASE pattern – expects a string containing the process ID (PID) like so: processname[PID] (e.g.: httpd[12345])

In the case of our Tomcat localhost_access logs, the program name is customized via our syslog config. As a result, the PID is missing (e.g.: accesslog[])

It took me a while to figure out, but because of this Logstash would not properly parse my localhost_access logs (see the Iterate section later on :) )

While not discussed extensively in this post, I also created a custom grok pattern to understand the logs our Tomcat application generates so I can get things like the userID the log statement pertains to, the Java Classname, etc to help perform more targeted searches

Iterate

Like anything you write to parse strings, you’ll likely hit corner cases. Maybe a field is not populated in some instances, or what would normally be a string ends up being a number, or maybe some mobile device passes a funky User Agent string.

The point is, once you get a set of Grok patterns in place, you will want to review what Logstash gives you to ensure that you’re parsing as many logs as you can.

The reason why this is important is that if there is not a matching Grok pattern, the entire line of logging will not be split out. While it will still be searchable, you will not get the full benefit of this logging framework.

Going back to my missing PID as an example, this simple deviation caused every single line of my logging not to be parsed. Until I came across that missing PID as the root cause, I was unable to do all of the cool stuff you’ll see later when I discuss Kibana.

Grok Debugger

Unless you’re a regex whiz, you’re likely familiar with using sites like RegexPal to build a regex against a target string. Luckily, there is a similar tool for creating and testing Grok patterns: Grok Debugger.

The really cool thing about Grok Debugger is that it is Grok-aware and understands the built-in Grok patterns (e.g.: HOSTORIP, URI, QS [Quoted String], etc). There is even a “discover” feature that will attempt to build a Grok pattern based on a sample line of logging you give it. While not 100% perfect, it does help make quick work of creating a custom Grok pattern, should the need arise.

Bringing it all Home

At this point I will assume you have done the following:

Deployed Logstash

Started passing logs through Logstash

Stored Logstash’s output in ElasticSearch

Used Kibana to query ElasticSearch, just to make sure things are working right

You can certainly get way more elaborate than this, but I’d recommend getting started with the simple version above. I’ve had such a setup ingest ~3GB of logs a day on a plain vanilla CentOS virtual machine with just 2 vCPU and 4GB of RAM – and it barely broke a sweat.

IOW, crawl before you walk ;)

With that PSA out of the way, now comes the real work:

Creating and verifying your customized Logstash config

Creating and falling in love with your Kibana Dashboard

My Logstash Config

Note that I will not go into all of the ins and outs of Logstash file config as there is plenty of info on that across the Intert00bz, but here are some sources to get you started:

While I certainly could have put the patterns directly in the Logstash config file itself, by using a separate patterns file it makes the config easier to read and manage.

The reason I had to create a pattern file was twofold:

I had to override the built-in SYSLOGPROG – and by extension, SYSLOGBASE – patterns because of our goofiness in not spitting out a PID along with the program name (e.g.: httpd:[])

Also, since our formatting of our Tomcat Access Logs deviated from the norm, I had to create a custom filter for that

Um, WTF is the “geoip” and “coords” stuff in there?

I briefly mentioned that we can GeoLocate clients based on their IP address. This magic comes courtesy of Logstash, which ships with a built-in DB that maps IP addresses to physical locations – Longitude and Latitude coordinates, to be exact. These coordinates are based on data from the various Internet Registries (ARIN,RIPE, etc).

This leads to an issue for “internal” applications: What to do about private IP addresses? Obviously, those will not be in the various Internet Registries. Unfortunately, we can’t customize these, either. Logstash gets this Geolocation data from the kind folks at MaxMind. In order to keep this free (as in beer), the Geolocation DB is in MaxMind’s proprietary format. More on handling internal IPs coming up in a bit.

Back to the Geolocation DB: The proprietary, binary format means we can’t (not easily, at least) use a customized version of the DB to specify where private subnets actually physically reside within our organization. Of course, Logstash is Open Source, thus you could implement support for the CSV formatted DBs MaxMind provides, but that’s a topic for another post.

I will start off by saying that a vast majority of the Geolocation config was blatantly stolen from Bryan White’s post you can find here. The short version is that you have to do some conversion of the coordinates to get them into the format that Kibana needs for the Bettermap panel. The really shrot version is just copy and paste what he has kindly provided and move on :)

Now, that just leaves the matter of how to handle internal IP addresses. Logstash recently added the idea of “conditionals” in its config files. This allows you to perform actions against a log statement only if certain conditions are met.

In this case, we want to skip Geolocation processing if either of these situations are present in a given log statement:

The sourceip field is missing

The value of the sourceip field falls into the range of internal IP addresses (e.g.: 10.x.x.x, 172.16.x.x through 172.31.x.x and 192.168.x.x)

The reason we want to skip Geolocation for these is the coordinates end up being 0,0 – which is somewhere West of Africa in the middle of the Atlantic. Not terribly useful data and it makes us wonder why so many people are visiting our site from the middle of the ocean ;)

As a result, I have the entire block of Bryan’s Geolocation config sitting inside this conditional:if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip]

What follows the if is a regex, which is likely self-explanatory, but just in case it is translated to:

If the field sourceip:

Is not in the range of private IP addresses

and

Is present (which is why you see [sourceip] after the and near the end of the statement)

… then go ahead and do Geolocation for this log statement, which is done via the geoip filter and the output of geoip is reformatted by the mutate filters to get them into the format Kibana expects.

The reason for checking that sourceip is present in the first place is because I am receiving logs from all manner of locations (e.g.: messages logs, etc), many of which will not have an IP address in them. If I did not filter these out, I would get alot of log statements with coordinates of 0,0. I learned that the hard way (see the “Iterate” section above).

The end result of this config is:

I am parsing Tomcat Access Logs and my Applications custom logs to get custom fields, some of which I defined (e.g.: userID) and some of which leverage built-in field names within Logstash (e.g.: user_agent)

All other logs (e.g.: messages, cron, etc) are simply stored as they are received with no special field creation, as I typically don’t have a need to search and/or chart on these

Note: Logstash still creates some fields, depending on the log type

For example, it will recognize cron logs and set the program field to “cron”

The output is going into the embedded ElasticSearch

Kibana Dashboard

At this point, I have laid the groundwork to get the fields necessary to populate my Kibana Dashboard. Again, this Dashboard is comprised of Panels, each of which will look for the presence of a given field for a document in ElasticSearch. Based on the contents of that field – and the type of panel – Kibana will visualize that data for us.

For example, the Bettermap panel will look for the value of the coords field in each record it receives. Then, based on that value, it will place a “pin” in the location corresponding to that value. A Terms panel can be configured to look at things like sourceip and give me a list of the top X (e.g.: 10) most common source IP addresses found in the logs, along with how many times that IP address appeared in the logs for a given time frame.

Before I get to the shiny, let me detail each of the panels in my Dashboard so you can better understand the resulting charts, etc.

“Events Over Time”: This is a Histogram panel that gives me the total number of log statements (stored as “documents” in ElasticSearch). Depending on the time range chosen, Kibana will aggregate these. For example, if you look at an hour’s worth of logs, it will aggregate the total number of documents by chunks of 30 seconds, then graph that aggregated count. Note: This is a raw count of all logging seen for the given time frame, regardless of log source.

“Mean Response Time”: This is the mean value of the response time, as found in only the Tomcat Access Logs. When you configure a Panel to look at the target field, that Panel will formulate its query appropriately. For example, the “Mean Response Time” Panel is only concerned about documents in ElasticSearch where the resp_time field (which is one I custom-defined in my patterns file) is present. The result is that the Panel issues a query like this to ElasticSearch to get the data it will graph:

“Most Frequent IPs”: As the title implies, this is the Top X (25, in this case) most frequently seen IP addresses, based on the sourceip field. As is the case with many of the fields in this Dashboard, this field is only present in Tomcat Access Logs.

“HTTP Response Codes”: This is a histogram charting the (hopefully low) total number of HTTP response codes that are > 400, with the idea being to see how many “errors” are returned to the users.

“Trends”: Using the Trends Panel, this compares the number of >300 Response Codes seen in this time frame to the same time frame in a previous period. In my case, it is 7 days ago, as my default view shows the previous hour. I chose 7 days to avoid situations where I would compare a weekday to a day on the weekend, as usage is mainly during the work week in my case.

Note that the Trends Panel is only as valuable as the data you feed it. If you have a small number of hits for a given term, you can see massive jumps in percentage compared to your trend target.

As you can see from the example below, I need to re-evaluate the usefulness of this panel. But, I did say to iterate, right? :)

“Browsers”: This is a pie chart breaking down the ratio of various client browsers, based on the User Agent string.

“Operating Systems”: Another pie chart to show the breakdown of client OS’, again based on the User Agent string.

“HTTP Errors”: A pie chart showing a breakdown of the same data shown in the HTTP Response Codes Histogram.

“Server Activity”: This Terms Panel shows the relative amount of logging from each of the node types in the distributed application I am monitoring (e.g.: Tomcat, memcached, MongoDB, etc)

“Client Locations”: A Bettermap Panel that uses the Geolocation coords field to show where client requests are physically coming from

“Logs”: A Table Panel that shows the @timestamp, message, and sourceip fields. Clicking any row in this table expands to show all of the fields related to this log statement. This is part of the canned Logstash dashboard that ships with Kibana.

Enough, already. Just show me the whole Dashboard!

OK, OK. Geez.

The final result is below. It is admittedly a bit busy, but I’ve built it to be as much of a tech demo as it is also a useful tool.

I hope this post helps you create a useful dashboard of your own!

Update: Adding Kibana Template

I’ve had a few folks ask for the Kibana template used for the dashboards above. So, here it is! Hope you find it useful.