This blog post will talk about how to transport logs from multiple servers to a central location for storage or analysis. Apache Kafka is chosen for this purpose to allow for easy scalability as well as for it’s versatility and community.

Good and meaningful logs are the first step to be able to quickly identify and debug issues as well as predict future problems. As long as the whole application stack fits onto one or two servers, manual reading or parsing of the logs will suffice most of the time. The more servers are added to the stack and the more services are running on these servers, the more complex it becomes to analyze and correlate the data. As soon as there are two application servers that are behind a round robin load balancer, the logs on a single server may no longer be enough to track down how a user triggered a bug or why a certain request took a very long time.

The first part of the series “Scalable and Robust Logging for Web Applications” described how to improve the default Ruby on Rails logger with Log4r to get more structured logs and data that matters. This is the second post in a series which’s goal it is to develop a robust system for logging, monitoring and collection of metrics that can easily scale in terms of throughput – i.e. adding more application servers – but is also easy to expand to new types of log data – i.e. by adding a new database or importing external data sources – and makes it easy to modify and analyze the data exactly as you need.

(Subscribe via Email on top of the right side-column to get informed as soon as the next post in this series is published!)

What to Transport? Logs vs. Metrics

The first consideration should be if it is possible and/or necessary to transport all logs to a central location. If there are many servers or a lot of log data, this might be very resource intensive and aggregating or filtering the data might be necessary. The extremes of this are either transporting every single log vs. only transporting aggregated metrics. The following paragraphs try to help you decide on the right balance for your use case.

Advantages of transporting all logs:

Metrics can be added, modified and deleted in one central location

Historical data on new metrics can be computed from the stored logs

Possibility to peek into live data-stream

Allows building complex debugging and monitoring tools

Central location for all logs. Invaluable for debugging, root-cause analysis and correlation of incidents

Advantages of transporting only metrics:

Transporting (aggregated) metrics requires far less bandwidth

Smaller storage requirements

Scales far better

Better than nothing

Overall transporting all logs has many advantages and should be preferred over aggregated metrics if possible. Especially managing metric definitions in one place and the ability to compute historic data for new metrics is very valuable. Also does transporting all logs allow for thorough (computationally expensive) data analysis on historic data to i.e. train machine learning models, predict behavior or give enhanced insights into who your users are and what they do.

There is no strict rule to follow and it is perfectly ok to mix and match. An example would be to just transport logs that contain examinable data and aggregate performance metrics, like average response time or jobs processed per minute, on the server.

This post will focus on the transport of raw log data. The posts “Server Monitoring with Sensu” and “Metrics with Graphite” will introduce better suited technologies to work with pure metrics.

Apache Kafka

Introduction

Apache Kafka is a publish-subscribe messaging system that can handle high throughput and is easy to scale horizontally, which makes it a perfect for transporting huge amounts of log data. It is written in Scala and was initially developed at LinkedIn.

Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers

Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

Distributed by Design
Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

– from the official Apache Kafka website

Following is an excerpt form the paper the team at LinkedIn published about Kafka. It describes the basic principles, however you are encouraged to read the complete paper to understand how Kafka works and why certain design decisions were made.

[Here is an introduction to the] basic concepts of Kafka. A stream of messages of a particular type is defined by a topic. A producer can publish messages to a topic. The published messages are then stored at a set of servers called brokers. A consumer can subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers. (…) To subscribe to a topic, a consumer first creates one or more message streams for the topic. The messages published to that topic will be evenly distributed into these sub-streams. (…) Unlike traditional iterators, the message stream iterator never terminates. If there are currently no more messages to consume, the iterator blocks until new messages are published to the topic.– Kreps et al, “Kafka: a Distributed Messaging System for Log Processing”

Getting Started With Apache Kafka

This part will talk about how to set up a local Apache Kafka instance for testing purposes and how to configure a production ready three-node Kafka cluster using Zookeeper. This post will use the latest version of Apache Kafka – 0.8.1 – which is still considered beta but seems to be reasonably stable and is used by companies like LinkedIn in production.

Running Kafka Locally

Kafka comes with simple local scripts to start preconfigured Zookeeper and Kafka instances as well as simple producers and consumers. This allows to get started with development in less than a minute. Similar instructions for the latest stable release – 0.7.2 – can be found on the Apache Kafka 0.7 quickstart page.

The following excerpt from the Apache Kafka 0.8 documentation lists the commands necessary to get a single local Kafka server up and running. The guide describes the steps in more detail should something be unclear.

This setup allows to test and code against a Kafka 0.8 instance on your local machine. The next part describes how a production cluster can be set up.

Setting Up a Production Ready Kafka Cluster With Zookeeper

To run a cluster with reasonable reliability, three machines with a replication factor of two should be the minimum in a production environment. Three machines allow one to go down or replacing as well as updating a single node in the cluster without creating a single point of failure.

A replication factor of two means that the data is at any time at two separate servers – with some synchronization delay. Should one server go down, all the data that was already synchronized to another node will be available there and Kafka will make this node the new “leader”, the primary node to write to and read from. As soon as the node fails, another node will become the “follower” (on a topic by topic basis) to satisfy the minimum replication factor.

Because replication in case of a failover also puts stress onto the servers, it is a bad idea to only have as many servers as are immediately necessary to handle the load. Always make sure the cluster can handle the peak load even if one broker goes down to not run into problems when an instance actually breaks or needs replacement. If this is not the case, increase the number of Kafka servers. A replication factor of two can be maintained with a larger amount of nodes unless greater reliability is desired.

Zookeeper is a little different as it needs a majority of nodes to operate. This means in a three node Zookeeper cluster, only one node can fail. Because in this example three nodes are co-located with the Kafka servers, it is a good idea to have at least one more Zookeeper instance outside of the Kafka servers to ensure that consumers still can process messages if two of the Kafka servers go down. This is a must if any other services in your stack use Zookeeper as well. (For more information about using Zookeeper, read Scott Leberknight’s excellent introductory series on using Zookeeper)

Configuring Zookeeper

It is a good idea to use tools like Chef or Puppet to configure and set up services like Zookeeper and Kafka. They not only allow to maintain all configuration files in a single place and adapt them based on the server they are deployed to, they also allow to replicate setting up a new node easily. This means instead of installing and configuring a new instance, only a single command is necessary to provision a new Kafka server and integrate it into the cluster.

This part will talk about the most important parts of configuring Zookeeper, however to get the best performance you have to read the docs yourself and tune your configuration to match your use case. Most importantly, don’t forget to tune your JVM settings as well to make sure there are no memory issues slowing down your application. Monitoring Zookeeper and Apache Kafka can be done using tools like JConsole or jmxtrans.

Each Zookeeper server has to have a unique ID between 1 and 255. It has to be stored in the myid file, containing only the id of the current Zookeeper instance. The zoo.cfg configuration file contains a list of all the zookeeper instances that are part of the cluster. Each instance should know of all the other instances in the cluster:

Configuring Apache Kafka

The Kafka brokers need more configuration to make sure they are working as efficient as possible on your machine, especially considering how many disks, cores and memory your servers have. The most important configuration options for replication are following:

# brokers before shutting itself down. This reduces the unavailability

# window during shutdown.

controlled.shutdown.enable=true

Take a look a the documentation and tweak the other parameters, most importantly concurrency and memory, before starting up the cluster.

Handling Logs

After the Kafka cluster is set up, the next step is to take a look at the different logs and the storage methods used.

Log to disk

Each program should log to the disk, even if logging directly to Kafka is available in the logger. One reason is that if something goes wrong during the transport, the full logs are still on the server and can be reviewed manually. The other concern is that Kafka uses TCP and therefore the log transport might slow down or block the program itself, which has to be avoided at any cost.

Another benefit of this approach is that it keeps the log pipeline the same for each program.

Logrotate

When saving logs to disk it is important to rotate the logs so the files don’t become too large, have a good structure to allow searching the logs manually and, most importantly, allow easy (automated) deletion of old logs so these don’t fill up the hard drive.

Even though many logging frameworks support log rotation on their own, using the logrotate command seems slightly better (imho) because all settings can be viewed in one file or script and changes to the log rotation don’t affect the application.

Transporting Logs

tail2kafka

There is an excellent program for Kafka 0.7 brokers that tails your log files and sends the data to Kafka, called tail2kafka. However it does not support the Kafka 0.8 protocol and therefore would not allow to use features like topic partitioning / load balancing through the broker.

There is still an issue with the shutdown, but it should not affect the running operation of the tool. I am no python wizard, so if you know how to fix it, please let me know or submit a pull request.

I hope this article gave you a good introduction into how to use Kafka to transport your logs off the server for processing. The next posts in this series will take a look at how to process the data once it’s sent to Kafka.

What do you think? Do you have any comments, improvements, questions or just enjoyed the read? Go write a comment below, at reddit or HackerNews and I try to respond as quickly as possible :)

Help others discover this article as well: Share it with your connections and upvote it on reddit and HackerNews!

Subscribe to the blog to get informed immediately when the next post in this series is published!

8 thoughts on “Log Transport and Distribution with Apache Kafka 0.8: Part II of “Scalable and Robust Logging for Web Applications””

Thank you for the informative write up. I landed on this page while search for ways to merge multiple apache httpd logs(load balanced). If I understand correctly, I can use tail2kafka to consume logs from all the various httpd servers and then merge them on the kafka cluster. Does this sound right?

Do you mind if I quote a few of your posts as long as Iprovide credit and sources back to your website?

My bloog site is in the exact same area of interest as yours and my visitors
would truly benefit from a lot of the information you present here.
Please let me know if this ookay with you.
Appreciate it!

tail -n0 will start listening to all changes from this point on. -n tells tail how many lines from the bottom of the file it should print. 0 in conjunction with -f means follow changes but don’t print anything that happened before I started tail.

Hi, Can i know what is the benchmark for this solution. Just trying to find if there are other options to filebeats/rsyslog, etc. Stumbled upon your blog. Our Pipeline is kafka as well. So would like to know if using this make things easier.
How many files can this read at once ??

Meet me at…

May2018

Categories

About the Author

Markus Lachinger is an entrepreneur, product manager and full-stack software engineer with a Masters in Software Management from Carnegie Mellon University. He is based in Mountain View, CA and blogs mainly about Ruby, Scala and Scalable Infrastructure. Find out more