This article is an introduction on how to pre-process logs from multiple sources in logstash before storing them in a data store or analyze them in real time. Some common use cases are unifying time formats across different log sources, anonymizing data, extracting only interesting information from the logs as well as tagging and selective distribution.

The first two parts of the series “Scalable and Robust Logging for Web Applications” described how to improve the default Ruby on Rails logger with Log4r and how to transport logs to a central location. This is the third post in a series which’s goal it is to develop a robust system for logging, monitoring and collection of metrics that can easily scale in terms of throughput – i.e. adding more application servers – but is also easy to expand to new types of log data – i.e. by adding a new database or importing external data sources – and makes it easy to modify and analyze the data exactly as you need.

(Subscribe via Email to get informed as soon as the next post in this series is published! Sign up now on top of the sidebar to your right!)

Why Pre-Process Logs?

Why not just pipe everything into a big HDFS cluster and run MapReduce jobs on it? There are a few reasons pre-processing is essential to make the actual data analysis easier and faster, one of the biggest is data cleanup and unification. Not only is it faster to process the log data once and then store it in a unified format, but also more secure, as there is only one codebase that needs to be tested instead of each single query that performs standard data manipulation.

1

2

3

4

5

6

[Mon Aug2315:25:352010]#Apache

11020812:12:06#MySQL

2013-04-04T14:13:34+00:00#Heroku

Started POST"/users"for127.0.0.1at2013-11-2714:14:53-0800#Rails

A big problem of heterogeneous logs the different format of timestamps. Even worse, some formats may exclude the timezone or even the year. Also are they in different positions within the log and make sorting, selecting and comparing by time much harder. Just counting the logs between last Monday and Friday requires five different parsers for each line when using data from the examples above.

Standard Rails logs show another big issue with certain kinds of logs. While we got rid of the verbose Rails log using Log4r other programs might not allow for this kind of customization of the logging output. Finding information in logs where each log event is stretches across multiple lines is much harder than in well-formatted single-line logs. Furthermore logs might contain sensible information that should be kept out of the stored logs, like passwords and user-identifiable data.

Logstash has been built to solve exactly these and many more problems with ease:

Introducing Logstash

Logstash is a Java-based tool that allows pre-processing logs. It has four basic phases, input – decode – filter – output, in which the logs can be annotated, trimmed, unified and modified in many other ways through corresponding plugins. Logstash already comes with a very comprehensive set of default plugins and extending it is very simple due to it’s modular structure. Another outstanding feature is grok, a “write once, combine everywhere” approach to regexes, which also has a great online interpreter to help debugging.

Workflow

Input

There are multiple ways to preprocess logs depending on the use-cases and the available resources. The most common ones are

Logstash is used to process the logs on each server and sends the results directly to the storage.
+ Less data to transport

Each server sends the logs directly to a central logstash instance for processing
+ Central place to make configuration changes.

Each server sends the logs to a log aggregator or pub/sub system. Logstash subscribes and processes logs. The results are written to a data store or back to the pub/sub.
+ Scalability
+ Allows data to be accessed by multiple systems

Each server sends the logs to a storage server, like Hadoop. A subset of the logs is sent to logstash for processing and distribution.
+ Good for very high throughput and use-cases where not all logs need preprocessing or are used for analytics

This shows the versatility of logstash and how it can be used in many different stages of the log processing. This series will focus on option three.

Decoding

Logs can come in many different forms and shapes. Not only do logs have different patterns to store their data, some might even come already in a structured form like JSON. In this step structured data is extracted into variables that can later be manipulated and stored.

Filter

In this part the data in the variables can be modified, combined and parsed. Common use-cases are unifying date formats and converting timezones, tagging logs based on source or content, anonymizing data, creating checksums, extracting or converting numbers, decoding JSON or XML data as well as creating simple metrics.

Output

After extracting, unifying and manipulating data logstash supports a vast amount outputs, like databases, TCP and UPD as well as IRC or Pagerduty. And if the necessary output does not come with logstash, the community is very active and most likely already built a unofficial output. Worst case, output filter are just ruby code, so creating a custom output is very easy.

This series will demonstrate reading input from Kafka, decoding and pre-processing different types of logs and then output the result back to Kafka as well as Storm, in order to allow for real time analysis , ElasticSearch, to allow querying and visualizations via Kibana, and Hadoop for long-term storage and large scale analytics.

Getting Started with Logstash

Seting Up Logstash

The config file is written using JSON syntax and contains three top-level entries, input, filter and output. Next is a very basic example that reads from stdin and prints to stdout in the rubydebug syntax:

1

2

3

4

5

6

7

input{stdin{}}

output{stdout{codec=>rubydebug}}

# This example is from logstash's Getting Started Guide:

# http://logstash.net/docs/1.4.0/tutorials/getting-started-simple

After unpacking and configuring logstash, the following command starts it:

1

2

3

bin/logstash-flogstash.conf

You can now type something like Hello World and press Enter. Logstash should return a JSON Hash/Map with your message and some other useful information like the timestamp and the host:

1

2

3

4

5

6

7

8

9

HelloWorld!

{

"@timestamp"=>"2014-04-21T05:17:57.641Z",

"@version"=>"1",

"host"=>"Markuss-MacBook-Pro.local",

"message"=>"Hello World!"

}

In the remainder of this section we will set up logstash to consume data from Kafka and process it in multiple ways.

* at the time of this writing the latest version was 1.4.0

Consume From Kafka 0.8

In order to get data from Kafka a third party plugin has to be used. The only one that supports Kafka 0.8 at the moment is logstash-kafka, which is based off of jruby-kafka. This plugin requires logstash to be rebuilt with this plugin, but the author included a makefile that will automate this step:

Shell

1

2

3

4

5

6

7

8

$makeflatjar

#specify logstash version

$makeflatjar LOGSTASH_VERSION=1.4.0

# you can also specify different scala or kafka versions using SCLA_VERSION or KAFKA_VERSION respectively

Building the plugin however requires some dependencies like jruby, scala, kafka and jruby-kafka to be available. Furthermore the plugin does not have any tests. If this seems to much of a risk, an alternative solution is to write a small Java program that uses the default consumer that comes with Kafka and sends the data to logstash via TCP/UDP. Following is a sample logstash.conf where logstash-kafka is used to input data from Kafka 0.8.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

input{

kafka{

zk_connect=>"kafka1.mmlac.com:2181"

group_id=>"logs"

topic_id=>"logstash"

reset_beginning=>false

consumer_threads=>1

consumer_restart_on_error=>true

consumer_restart_sleep_ms=>100

decorate_events=>true

}

}

Parse Logs

As soon as the Kafka messages are coming in they have to be converted into key-value pairs. If the messages are in a supported format, they can automatically be converted by using a codec. Supported codecs include JSON, Graphite or multiline, which allows compressing multiple lines like stack traces into one message, and others.

decoding JSON that is delimited with \n

1

2

3

4

5

6

7

8

9

10

input{

file{

path=>"/var/logs/webserver-json.log"

codec=>json_lines{

charset=>...# charset name, default: "UTF-8"

}

}

}

If logs are not in one of these formats the grok filter is a great way to split up the logs into key-value pairs based on regexes.
break_on_match=>true will break out of the current grok filter as soon a matching regex is found. This helps performance as matching regexes are pretty expensive operations. Without this flag every regex in the filter will be executed on each line every time, even if a previous one already matched.

Access Data in Logstash

After the input has been parsed, data can be accesse by using square brackets. Assuming the parsed data contained a field headers which was a map of fields that were sent to the server in the request header. To access the referrer field that was sent in a request, the syntax is [headers][referrer]. Or take a look at another example on the logstash website.

When outputting data sometimes it is necessary to name an output based on the value of a field. To i.e. send a message to statsD denoting which response code the request had, the following code can be used, which will send a increment notice for apache.500 when the request that is currently processed resulted in a 500 Internal Server Error response code.

1

2

3

4

5

6

7

output{

statsd{

increment=>"apache.%{[response][status]}"

}

}

Adding Tags and Using Conditions

Logstash allows tagging certain types of input as well as applying filters only when certain conditions are met.

The first way to add tags as well as types is during the input phase. Types are in essence the same as tags, however there is only one type per input, and mainly used for filter activation.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

input{

file{

path=>"/var/logs/webserver.log"

tags=>["web"]

type=>"log"

}

file{

path=>"/var/logs/loadbalancer-error.log"

tags=>["loadbal"]

type=>"error"

}

}

Logstash also supports the usual conditions if, else if as well as else and can use tags, types and any fields to selectively apply filters.

logstash.conf

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

filter{

# matching string with regular expression

if[path]~="/login"{

mutate{add_tag=>["login"]}

}

# filter by type

if[type]=="error"{

mutate{remove=>"password"}

}

}

output{

# is "metric" in the tags array?

if"metric"intags{

statsd

}else{

file{

path=>"../output-logstash.log"

codec=>"rubydebug"

}

}

}&nbsp;

Anonymize Data from Logs

Some legislations, especially in the EU, require logs to be anonymized to a certain degree, i.e. the raw IP address cannot be stored without consent. Logstash makes this easy by providing a anonymize filter. If takes the fields that should be anonymized as an input and hashes them. It also has a special mode for IPv4 addresses to truncate them based on the subnet prefix length.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

filter{

anonymize{

algorithm=>"SHA1"

fields=>["address","firstname","lastname","secretData"]

key=>"ThisIsMyHashingKey"

}

anonymize{

algorithm=>"IPV4_NETWORK"

fields=>["ip_address"]

key=>"16"# subnet prefix length, 16: 255.255.255.255 -> 255.255.0.0

}

}

Create Simple Metrics

Logstash also allows creating simple metrics from fields. It supports two modes, called meter and timer. Meter counts the occurrence of a field and outputs sliding windows of the rate (events per sec) for the last 1, 5 and 15 minutes. Timer is used for getting averages as well as percentiles over the value of a field.

Metrics are flushed separately every 5 seconds, or whichever value is set using flush_interval. The percentiles and counts are reset based on the value of clear_interval, which should always be a multiple of the flush-rate to prevent inaccuracies.

logstash.conf

1

2

3

4

5

6

7

8

filter{

metrics{

meter=>["http.%{response}"]

timer=>["render_time","%{time}"]

}

}

Result

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

# Inputs in json_line format

{"response":500,"time":123}

{"response":333,"time":500}

# Output after feeding some of these metrics to logstash

{

"http.333.count"=>6,

"http.333.rate_15m"=>1.1867404671527066,

"http.333.rate_1m"=>1.015778069868737,

"http.333.rate_5m"=>1.160659320578407,

"http.500.count"=>8,

"http.500.rate_15m"=>1.1867404671527066,

"http.500.rate_1m"=>1.015778069868737,

"http.500.rate_5m"=>1.160659320578407,

"render_time.count"=>14,

"render_time.max"=>500.0,

"render_time.mean"=>311.5,

"render_time.min"=>123.0,

"render_time.p1"=>123.0,

"render_time.p10"=>123.0,

"render_time.p100"=>500.0,

"render_time.p5"=>123.0,

"render_time.p90"=>123.0,

"render_time.p95"=>123.0,

"render_time.p99"=>123.0,

"render_time.rate_15m"=>2.3734809343054133,

"render_time.rate_1m"=>2.031556139737474,

"render_time.rate_5m"=>2.321318641156814,

"render_time.stddev"=>14.031458544495445

}

There are also other services that are built with the sole purpose to handle stats and help analyze those, for example statsd and graphite. Sending data to these services can be done by outputting only specific fields, like shown in this example for apache logs:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

filter{

grok{

type=>"apache-access"

pattern=>"%{COMBINEDAPACHELOG}"

}

}

output{

statsd{

# Count one hit every event by response

increment=>"apache.response.%{response}"

# Use the 'bytes' field from the apache log as the count value.

count=>["apache.bytes","%{bytes}"]

}

}

Save Output to Multiple Destinations

After the data is pre-processed, it needs to be transported to the next stage, be it final storage, indexing, real-time processing or other applications.

Logstash does this in its outputs and has a wide variety of standard outputs as well as many plugins to extend this selection. This article will demonstrate pushing the pre-processed data to ElasticSearch, Hadoop, Kafka and Storm. Outputters also can filter what data is processed by filtering on tags and attributes as described above in Adding Tags and Conditions.

ElasticSearch

Save all messages that have the tag index to be indexed by ElasticSearch:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

output{

if"index"intags{

elasticsearch{

action=>"index"

bind_host=>"localhost"

bind_port=>9300

cluster=>"logstash-cluster"

codec=>"plain"

embedded=>false

flush_size=>5000

idle_flush_time=>1

index=>"logstash-%{+YYYY.MM.dd}"

protocol=>"http"# if you run anything older than ElasticSearch 1.0.1

}

}

}

This is just a very basic example and especially the latest version of logstash and ElasticSearch add more features like templates. Also the parameters have to match how the indexes in ElasticSearch are set up. Consult the full documentation of the ElasticSearch output before sending data to an existing cluster.

Apache Kafka

Kafka is amazing at distributing messaged to multiple consumers, therefore also a great destination for the pre-processed data. Writing back to kafka will use the same plugin as mentioned above, logstash-kafka.

1

2

3

4

5

6

7

8

9

10

output{

kafka{

:broker_list=>"kafka1.mmlac.com:9092"

:topic_id=>"logstash"

:compression_codec=>"snappy"

:request_required_acks=>1

}

}

Apache Storm

There is no direct predefined way to get data into Storm. Two approaches are either using Kafka as an intermediary queue and create a Kafka consumer spout, the name for data sources in Storm, or a TCP spout that sends data directly to Storm. Considering a use-case of real-time data analysis, Kafka would introduce unnecessary latency but can help with scaling.

Using ElasticSearch

ElasticSearch is logstash’s favorite storage/indexing engine and therefore there are a few nice plugins that work very well with this combination and make it easier to use.

A small web-client is shipped with logstash and can be stared by addinf web to the logstash command like
bin/logstash agent-flogstash.conf web , which will launch the agent and the web client at the same time. This is a quick way to get started, especially when using the embedded elasticsearch via
output{elasticsearch{embedded=>true}} .

For more sophisticated applications, elasticsearch-kopf and Kibana are tools to know. Kopf is a tool more focused on administrating the ElasticSearch cluster whereas Kibana is optimized for data analysis and displaying the gained insights.

elasticsearch-kopf REST query interface

Kibana displaying data from ElasticSearch

Using Grok

Writing grok patterns is sometimes as complicates as writing plain old regular expressions. Luckily there are amazing online tools available for both.

Grok Debugger helps with writing grok pattern and also has a nice list of common patterns. For full-blown regular expressions in ruby rubular is essential. It has a small cheat-sheet at the bottom and allows evaluating full regexes against test-data, immediately shows matches and captures.

Logstash is extendable via jruby and the community is very supportive of beginners. If there is a tool, plugin or feature missing, mention it in the mailing list or just code away and create a solution yourself.

I hope this article gave you a good introduction into how to use logstash to pre-process your logs. The next posts in this series will take a look at how to process these logs in real time using Storm.

What do you think? Do you have any comments, improvements, questions or just enjoyed the read? Go write a comment below, at reddit or HackerNews and I try to respond as quickly as possible :)

Help others discover this article as well: Share it with your connections and upvote it on reddit and HackerNews!

Subscribe to the blog to get informed immediately when the next post in this series is published!

13 thoughts on “How to Pre-Process Logs with Logstash: Part III of “Scalable and Robust Logging for Web Applications””

Just an update for your article, the kafka-logstash plugin does have tests now and has been in production use for nearly a year at roomkey.com shuffling thousands of logs per second as needed. Further, the plugin is being introduced into the mainline of logstash beginning with version 1.5.

Thanks for the interest in my plugin and the great writeup as to getting stuff going. I’m sure people find it informative.

3. I also need to calculate metrics and display them in graph, store the metric datapoints in database.

Any article i have seen online, they have suggested to use Redis or equivalent like ZeroMQ or RabbitMQ) to act as a buffer. Firstly do you really require this redis in the pipeline. Can’t we directly index it from the client itself (or push it to Amazon S3 or Mongo or store it in Database). The reason what they have mentioned in the groups is that Indexing is CPU intensive, you need a central place where you configure what you want to do with your log file i.e outputs

If you are ok with spending few CPU cycles and also if you have a configuration file and you want to change what you want to do with output, you can come up with a script which will either download the file from central locatiopn each time it runs or read the config file from GFS location. So what i am not able to get here is what is the real purpose redis i serving here.

Secondly i am concerned with redis scalability. In the docs they have spoken about high-availability – failover mechanism of redis. This is fine. But dont you think redis will become a bottleneck. Is there anything like you can have a cluster. ( I am not ok with having different master redis server for a set of clients to scale out – difficult to maintain). Can you do clustering?

Thirdly i want to do monitoring – DB writes/sec, DB reads/sec, number of messages in the queue, what is the cpu% in machine1, machine2 and so on. I want store all this data in a DB like mysql or mongo and also want to display them in graph. Is it possile or should i have to use like nagios or sensu.

ELK can help you centralize logs, Kibana will help you to view the log data and search for it using Elastic Search as the data repository. Logstash has a number of plugins to allow you to gather metrics and ship them to where you like for a special view or special processing.

The broker, [redis, rabbintMQ, zeroMQ, etc] isn’t required but recommended if you want to be able to scale or handle a significant amount of data. There are plugins to allow you to use AWS SQS.

The purpose of the broker is to act as a buffer for the data going into logstash. With high volumes of data, without a broker of some kind, you may have contention for the indexing versus the incoming IO. This could create a condition where messages could be lost. Using a broker will allow the data to hang around for a time while you fix the problem or wait until the indexer is available to grab more data.

Yes, you could certainly write a script to download the file to process it. But then we are defeating the purpose and reason for setting up ELK in the first place. It will do all that for you.

Yes, normally in production we would set up the broker in a cluster for both failover and high-availablity. Or technically, AWS SQS would allow Amazon to handle cluster issues in the background and you could just read and write to it through the plugin.

While we could use ELK to do monitoring, monitoring is usually done through log files. Probably not the best use case. However, you can certainly use logstash-forwarder to forward any data you give to it to statsd, nagios or wherever you want.

Meet me at…

May2018

Categories

About the Author

Markus Lachinger is an entrepreneur, product manager and full-stack software engineer with a Masters in Software Management from Carnegie Mellon University. He is based in Mountain View, CA and blogs mainly about Ruby, Scala and Scalable Infrastructure. Find out more