Thibaut Géry

Centralize logs from Docker applications

I would like to thanks my coworkers at Octo for their precious
feedbacks

This article aims at showing how we can centralize logs from a Docker
application in a database where we can then query them.

This article is built around an example where our application consists of an
nginx instance, an Elasticsearch database, and Kibana to render beautiful graphs
and diagrams. The code of the example is available on github.

We need to collect and transport our logs as a data flow from a distributed system to a
centralized remote location. That way, we can get an aggregate vision of the system
in near real time.

The logging system is plugged at the container level because the application
should be loosely coupled with the logging system. Depending on the environment
(development, pre-production, production) we might not send logs to the same
system: a file for development, Elasticsearch for pre-production, Elasticsearch
and HDFS for production.

Architecture

Choosing our middleware

We need a tool to extract the logs from the Docker container and push them in Elasticsearch. In order to do that, we
can choose between several tools like Logstash
or Fluentd.

Logstash is built by
Elastic and is well integrated
with Elasticsearch and Kibana. It has lots of plugins.
Fluentd describes itself as an open source data
collector for unified logging layer. Docker provides a driver to push logs
directly into Fluentd. Fluentd also has a lot of plugins like one to connect to
Elasticsearch.

Infrastructure

We can use two types of infrastructure: either a classic architecture with
servers or a cloud-based one. I’ve chosen the classic one for simplicity’s sake.

Therefore two servers are needed, one for our application and Fluentd, and one for our
Elasticsearch database and Kibana.

Process

The process can be described in 6 steps

Users connect to our application (nginx) and this generates
logs

Our containerized application sends its logs to stdout and stderr

Docker intercepts logs from the container and uses its native Fluentd output
driver to send them to the Fluentd container running locally

Fluentd parses and structures logs

Structured data are sent to Elasticsearch in batches, we might have to
wait a minute or two for the data to arrive in Elasticsearch. We can
parameterize this behavior with the
Buffer plugins

Data is exposed to administrators through graphs and diagrams with
Kibana

Application

As the application is a simple nginx, I’ve packaged a new image since
the official one uses a custom logger that is not appropriate for our purpose.
We can run the app using Docker-compose up with the following Configuration

The first step defines how to capture the data, here on port 8888 using HTTP.

The second step defines how to output the data, in this case by printing it to stdout.

The data is streamed through Fluentd. Each chunk of data is tagged with a label
which is used to route the data between the different steps.

In the previous example, the tag is specified after the key match : app.access.
The tag of the incoming data is the URL of the request.
For example running curl https://localhost:8888/myapp.access?json={"event":"data"}
outputs {"event":"data"} to stdout.

The Docker driver uses a default tag for Fluentd: Docker._container-id_ .
We override it to be nginx.Docker._container-name_ with the log_opt,
Fluentd-tag: "nginx.Docker.". The tag in the Docker driver must
match the one in Fluentd. We should be able to see the Nginx logs in Fluentd
container log.

Right now, our system is useless. We need to send logs to a distant database,
Elasticsearch.

Then we can run the application and query it with our favorite browser to fetch
some lines form Elasticsearch in the Logstash index. Since Fluentd buffers the data
before sending them in batches, we might have to wait a minute or two.

Unfortunately, only the Docker metadata are sent (like the Docker name, label, id…)
but the log field contains the raw log lines of nginx and it is not structured. For
example, we cannot query all failed HTTP requests (status code >= 400)

This line of log need to be parsed.

Structure the application logs

Fluentd needs the fluent-plugin-parser
in order to format a specific field a second time. I have packaged the image with it here

keeps the previous informations in the message and emits it as docker.**

Here we use the tag concept to route the data through the correct steps:

the data arrives with the tag set by docker driver :
nginx.docker._container-name_

Fluentd sends it to the second step

the tag is modified to docker._container-name_

data goes through the third step

data is sent to Elasticsearch

Here is the structured data we can now used to create diagrams :

We can then run the application and query it with our favorite browser to see
the data correctly formatted in Kibana.

Run it

Our system collects logs from our application and send them to
Elasticsearch. The Docker engine requires to have Fluentd up and running to
start our container.

Even if Fluentd dies, our containers using Fluentd continues to
work properly. Furthermore, if Fluentd stops for short periods of time, we
do not lose any piece of log because the Docker engine buffers unsent messages,
so that they can be sent later when Fluentd is back online.

Finally, since Docker 1.9 we can show labels and environment variables with the
logging driver of Docker. In our example, we added: service: nginx and it shows
up in Kibana.

We are now able to create graphs such as this one:

Conclusion

So far, we have seen how to collect and structure logs from Docker to push them in
Elasticsearch. We can easily change the Elasticsearch plugin to the Mongo or
HDFS plugin and push logs to the database of our choice.
We can also add an alerting system like Zabbix

We can add nodes to our infrastructure and add several containers in one node.
Keep in mind that this article doesn’t cover everything. For instance we have not answered the following questions :