December 12, 2016

Logstash by Elastic Co is more than simply a platform for stashing logs,
it’s a data-transformation and shipping pipeline that’s also very well suited for
log data. At my company, we use Logstash in at least one data-transformation
role for the main applications. In other companies, Logstash isn’t involved in
log-aggregation at all and is only used for its data transformation capabilities.

Essentially, Logstash is a specialized form of ETL pipeline. It extracts
log data from a variety of sources, transforms it using one of the many filters
available, and loads it into data-stores (it’s built to dump into
ElasticSearch, but it’s not required) or API endpoints. Quite handy. Here are
some of the things you can do with it:

Gather and collate metrics from the event-stream, dumping them into a time-series database.

Monitor twitter for keywords and send alerts to your marketing team.

Open JIRA tickets based on events discovered in the stream.

Logstash can scale from an all-in-one box, useful for testing out your parsing logic,
to massive distributed parsing architectures supporting the data load of thousands
of nodes. It can scale with you as you grow, so long as you pay attention to the
distributed nature of some of what it does. We will get into that.

To understand how it works, first we need to talk about its architecture.

The Pipeline

The Logstash processing pipeline has three stages. Input, Filter, and Output.
Inputs and outputs have encoders and decoders that can translate data into or out
of common formats such as JSON.
Inputs define where you are getting your event data. Each item in the pipeline
is called an event. This is almost always a line of text, accompanied by a variety
of metadata depending on the plugin used. Events may be polled by Logstash, or
pushed to Logstash.

Filters define how you transform and enrich the events in the pipeline.

Each input { } block in the config-file is given its own thread. If you have
multiple high-throughput files, or different input types in your config, you may
get better performance on multi-core systems by defining each of those in a separate
input { } block.

Don’t go nuts with this, be intentional about it. For the file plugin, if you’re
dealing with fewer than 200 log-lines a second you will not gain much benefit
from splitting. If your application creates very active logs, and you’re also
monitoring syslog, you may want to put the application-logs in one input {} and
syslog-files in another to ensure syslog rarely gets backlogged behind the application-logs.
On machines with low core-counts, premature optimization can hurt.

Inputs queue up micro-batches, which are picked up by the filter stage. You can
configure how many workers process the filter queue through the -w or --pipeline.workers
command-line parameter in Logstash. By default, it’s set to 8.

The worker threads take the micro-batches off of the queue, and run them through
the filters. The same thread takes the transformed events, and then runs through the
outputs. On a multi-core system, this leads to highly parallel processing.

Of note, the threads do not share state between each other. Certain plugins,
such as the metrics plugin, keep some state internal to the thread.
When it flushes its events, a single logstash node will flush, by default, 8 separate
metrics events. In the case of metrics, it is very wise to use something like
statsd to further aggregate the metrics coming out of logstash. The
metrics plugin is the biggest trap for the unwary, as people like their numbers
and the squiggly lines on graphs that use them.

The state issue matters if you have a need to correlate event flows, such as marking
a start, middle, and end of an event contained on multiple log-lines. This needs to
be done at the input, likely with the multiline codec, or in a separate ETL process
outside of Logstash entirely.

Warning: If you are on logstash 2.1 or older, you’re running an inefficient
pipeline model where they added another queue between the filter stage and the
output stage. This way, outputs were single-threaded. It also meant that events
processed through the pipeline more slowly. On Logstash 1.5, our 4-core main parsing nodes
topped out at 47% CPU no matter how we adjusted worker-count or JVM Heap. When we
upgraded to Logstash 2.2 we changed nothing about worker-counts or Heap and the CPU
high-water mark reached 75%. The event throughput rate rose at the same rate, all
from that redesigned pipeline.

The Config-file

The config-file for logstash is broken up into three sections to match the pipeline
model. There are two other files for managing JVM settings and startup options,
but those will not be covered by this article.

input {
# Input plugin declarations go here.
# There can be more than one plugin per block.
}
input {
# There can be multiple input blocks, too. Threading, as discussed.
}
filter {
# Filter plugin declarations go here
}
filter {
# Like inputs, you can have multiple filters.
# In fact, you almost definitely will have more than one.
}
output {
# Output plugin declarations go here.
# yes, like the other two, there can be multiple declarations.
# And multiple output plugin declarations.
}

config files are kept at /etc/logstash/conf.d/*, the location of which can
be overidden by command-line parameters. It isn’t clear what order Logstash
loads those files, so it’s still a good idea to make a single big one to be sure
order is preserved. input {} blocks don’t care much about order as they all run
in their own threads. Filter blocks are handled by multiple threads and order is
important there.

When an event is created by an input plugin, a message field is created with the
contents of the text received. The event will also have metadata fields attached
to it that depeend on the individual plugins. Use of a codec, such as JSON, will
cause fields contained in the incoming text to be applied to the event. Once this
metadata tagging is done the event is handed off to the filter workers in micro-batches.

Filter workers then take the marked up event and process them. This can be anything
from adding fields, using the grok filter to parse fields for more fields, dropping
fields, casting fields to a data type, and many more transforms. For complex filter
chains, a field may exist only briefly; created and thrown away as filtering
proceeds.

first filter block creates a field called sub_type, and sets its value to authentication.
It also decodes the message field and creates a new field called auth_message.

The second filter block checks to see if the sub_type field is set to authentication
and then does some work on the auth_message field that should be there. Due to
ordering, if this second block appeared before the first it would never execute.

Now that we’ve discussed order and the overall file, it’s time to go into each section.

Inputs!

As of very recently, there were over 50 available input plugins.
That’s quite a lot, so I’m going to focus on the plugins that sysadmins would be
most interested in for gathering log and telemetry data:

file, because we work with log files.

eventlog, because Windows needs stashing too.

beats, for reasons you’ll see in a moment.

There are others that are very interesting, but would make this long read even longer:

TCP and UDP, send text at a port directly, and do away with your application logfiles entirely.

rabbitmq, zeromq, and kafka, receive events off of a message broker.

irc, hipchat, and xmpp, to archive chat messages for compliance reasons, allow analytics on support-channel traffic, or use as a message-passing bus.

jdbc, for pulling events out of RDBMS.

File

The file input is very straight forward. You give it a path or
an array of paths to monitor. That’s it. You can enhance the input’s performance
with additional options. Here is a simple example:

This configures a file input to monitor an entire directory. It has been given an exclude parameter
to ensure gzipped files are not stashed after logrotate runs. It uses the JSON
codec for events, because this hypothetical application dumps JSON-encoded log-lines.
This puts the fields in the JSON into the event directly, without need for parsing
them out during the filter stage. Handy!

A warning about multi-line events

Some applications, such as tomcat, are known to produce logs with events spread across many,
many, many lines. Logstash needs to be explicitly told about the multiline format
for a given input plugin. To handle logs of this type, you will need to specify
codec => multiline {} in your file {} declaration.
The options in this codec specify how multiple lines are aggregated into a single event.
For Tomcat, this could be, for lines that begin with whitespace, aggregate it with the previous line.
When this codec is used on an input it runs in a single thread, allowing it to
successfully merge multi-line events.

pattern and what settings declare that lines starting with whitespace
belongs to the previous event in the pipeline.

Multi-line events are not readily assembled if they’re not parsed directly from the
file. The distributed nature of logstash ensures that a single filter-thread is not
guaranteed to see all of the log-lines for a specific file.

Eventlog

The eventlog plugin is similarly simple. The one caveat here
is that Logstash has to run on the node you want to collect Event Logs from.
So if you want to collect from a Domain Controller, Logstash has to be on the
Domain Controller. Event Log forwarding may be of use here, but setting that up
is beyond the scope of this article.

will read in the System event-log, do so every 2 seconds, and set the
event type to eventlog-system. Because this is polling the System log, the user
that Logstash runs under will need to be granted the right permissions to see
this Event Log.

Beats

The beats plugin opens up a TCP port to listen for traffic
generated by Elastic Co’s beats framework of data shippers written in Go.
If the idea of putting a JVM on everything fills you with ennui, Beats may be for
you. They are intended to run directly on your application hosts, and do
no filtering. They can deliver events to the beats plugin here, or proxied through
Outputs such as Redis, Kafka, or directly into ElasticSearch. From there, a full
Logstash instance running one of those Inputs can filter the events gathered by
Beats and output them to their final destination.

input {
beats {
port => 5044
}
}

This tells the beats plugin to listen on the specified port, which is the default
port for beats. On the shipper side, you would tell Beats to send events through
their beats output to this server

Outputs!

We’ve seen a few examples of what configuration blocks look like now, so the outputs
should be no surprise. I’ll go over three different outputs: ElasticSearch, Redis,
and PagerDuty.

ElasticSearch

Because Elasic Co. makes Logstash, ElasticSearch is the platforms preferred output;
but you do not have to use it. If you do, the ElasticSearch Output
is how you do so. This can be very simple:

outputs to an ElasticSearch instance on the local box, or to a well-known
address on the network somewhere. It outputs to the logstash index, which is
the default name, and will create a new index every day (the {+YYYY.MM.dd} part
is where the rotation is configured). It’s up to you to clean up the old indexes,
though Elastic provides Curator to help with that.

For those who are more comfortable with ElasticSearch, you can definitely get
more complicated with it:

outputs to an index named audit, that is rotated weekly instead of daily
(xxxx.ww specifies the ISO year and week). What’s more, we are uploading a custom
Mapping to the audit indexes to ensure the fields that enter ElasticSearch are typed
and mapped correctly.

Redis

The redis plugin is often used in architectures where the Logstash
on the application nodes is configured to ship events without filtering, or with
Beats which can’t filter. Another set of Logstash nodes use the Redis input to
pull events off of the queue for filtering and outputting. This way, CPU and RAM
loading is isolated from application load.

sends all events to the redis server at logstash-redis.prod.internal, and
will store all events in the key named logstash-firehose. The input plugin is
structured similarly to this one. Parsing nodes will pull events out of that key
to filter and output them.

PagerDuty

The PagerDuty plugin allows you to wake humans up directly from
logstash. It’s a pain, but it can be a very useful part of your monitoring
infrastructure, as Logstash is a form of push-monitoring and can react faster than
most poll-based monitoring methodlogies. Not every organization will want to do this,
but it’s important enough I feel I should mention how it’s done.

setup will automatically trigger, and automatically resolve, the case where
a redis queue gets excessively backed up. Remember that incident_key is used
by PagerDuty as the unique identifier, so structure that to ensure your triggers
and resolves are touching the right incidents.

Logstash can also output to email, but I’m leaving it up to you to figure that out.
Our inboxes are already too full.

Now that we’ve handled inputs and outputs, time to filter our events.

Transforms! I mean, Filters!

We’ve learned how to get the data, and dispose of it once we’re done with it. Now
for the hard part: transforming it. There are a wide variety of filter plugins
available, but I’m going to focus on a few of the most useful for sysadmins.

grok: Using Regex to populate additional fields.

kv: Using key=value pairs to populate additional fields.

json: Using JSON expressions to populate additional fields.

mutate: to use logstash conditionals to manipulate events.

date: turn the timestamps in your logs into timestamps on the events.

Grok

And extract meaning from it in ways you can later automate. Want to get an alarm
when the PII backup is anything but OK? With Grok, you set it up so you can get
that alarm. A full breakdown of how it works would be a SysAdvent post all by itself.
Instead, I’m going to cowardly forward you off to an example I used at LISA this year.

KV

This allows you to turn strings full of key=value pairs into fields on an event.
Quite powerful, if you have something that logs that way. Take the following log-line:

will take that logline and create backup_objects and backup_size fields,
thanks to the prefix declaration. You can use these numbers later on for outputting
to statsd, opentsdb, or anything else really.

JSON

Like the KV filter, the JSON filter allows you to parse JSON. If that log line
had been formatted like this instead:

we’re importing a data-structure instead of a flat namespace, the created
fields will be backup.objects and backup.size instead. Where objects and
size are sub-fields to the backup field.

Mutate

Mutate allows you to add and remove fields, add and remove tags, upcase or lowercase
field contents, join arrays into strings, split strings into arrays, perform regex
replace operations, and cast fields into specific data-types. This is mostly used
to normalize your events, and clean up fields used for processing that don’t need
to be in the final data-stores.

Lets take the backup output we’ve been working with. We have some numbers in there,
let’s force them to a numeric type.

will cast both of those fields into integer, and remove the backup_message
field we processed using the JSON filter.

Date

The date filter is what you use to turn the timestamp in your logline
into the timestamp of your event. By default, the timestamp on an event is when
it is ingested. This can lead to apparent event-spikes if your ingestion pauses
and has to catch up.

. This reads the local syslog file (pre-systemd of course).
2. A Grok expression parses the syslog file and populates the usual syslog fields.
3. A date expression pulls the timestamp out of the syslog fields.
4. If the program is a backup program, a second Grok expression populates backup-specific fields.
5. If the event is backup_output and the action is STATS, parse the backup_message to extract details about the backup.
6. Then cast the datatypes for the stat fields to integer, and drop the backup_message field entirely.
7. Finally, output the whole lot to elasticsearch, into an index that rotates daily.

Architectures

At last, here are a couple of architectures for your Logstash setup. It can scale
to huge, if you do it right. This should give you the hints needed to see how your
own environment can grow.

In this architecture, Logstash is running on every node and feeding directly into
the backing storage. The filter {} sections on each node will be customized for
that particular workflow. The advantage to this design is that there is minimal
latency between when events enter the pipeline and when it is queryable in the backing
storage.

The disadvantage is that you’re doing potentially CPU and RAM intensive work on
a node that may be doing some of the same.

In this architecture, we’ve added a queueing layer. There is still a Logstash
running on the nodes, but all it does is ship events; the filter {} section will
be abscent, or very minimal on those. Events enter the queue largely unprocessed,
though with sourcing tags. A parsing tier of Logstash instances then services the
queue. This parsing tier has complex filter {} sections, as it has to deal with
multiple types of services.

The advantage to this architecture is that the Logstash footprint on the nodes
doing production work is very small. And can be made smaller if you leverage
Beats instead of Logstash (if that makes sense for you). The use of a queue allows
you to scale up your parsing tier as event loads change. That queue also allows you
a buffer for your events to handle spikes. There is a security boundary here as well;
since application nodes don’t have access to where the logs are stored, an attacker
can’t hide their tracks as easy as if they had write access to them like the short
architecture.

The disadvantage to this architecture is latency; for the most optimal path, there
are a few more network round-trip-times for an event to pass through before it is
queryable. Also, the parsing tier is very high throughput so mistakes in optimization
can lead to unexpected reductions in throughput.

Scaling tips

As you grow your logstash infrastructure, here are some tips to keep things running
at full speed.