CloudFoundry’s Loggregator Server

The goal of Loggregator is to allow application developers to tail the logs of their applications when these are running on CF. The central component of this is the Loggregator server which routes incoming messages. One of the key requirements for this server is that all developers get their logs fairly and that a malicious developer can not cause message loss for other developers by writing very fast loggers or really slow log consumers.

The following drawing shows the basic mechanism of message distribution (every sprocket is a goroutine). Messages come into the system on the left and are processed by the main processing loop which determines whether a message ids match and should thus be forwarded to a particular consumer. Every consumer forwarder has an internal incoming queue, which it takes messages out of to forward to the external consumer.

Congestion in a naive implementation

If a consumer, say consumer 1, slows down it is going to fill up its incoming channel over time. When it is full that channel will block the main message processing loop. A buffered channel will cause the same problem when the buffer runs full.

A channel-based ring buffer solution

Channels and goroutines to the rescue!

The idea is simple: Connect two buffered channels through one goroutine that forwards messages from the incoming channel to the outgoing channel. Whenever a new message can not be placed on on the outgoing channel, take one message out of the outgoing channel (that is the oldest message in the buffer), drop it, and place the new message in the newly freed up outgoing channel.

Plugging in this “channel struct” will never block and will simply behave like a ring buffer. That is, slower consumers might loose (their oldest) messages, but will never be able to block the main message processing loop.

Other solutions

A few packages are available that implement ring buffers in a more classic way by using slices and moving pointers: e.g., container/ring and gringo.

The problem with these implementations is that they need locking to be used concurrently. In the case of container/ring proper locking needs to be ensured by the user of the package. In the case of gringo you will see extensive locking throughout the package when looking at the source code.

3 Comments

Jesse Zhang says:

Nice post on the ring buffer. I had a similar (and slightly simpler) solution in mind, which is just dropping the new incoming messages whenever there is a slow consumer.
It is not precisely the ring buffer semantic, but on a higher level solves the same problem: not blocking the producer (main loop). It also avoid the (theoretic) problem of doing a potentially blocking read on the output channel: there is a small window between:
– when we detected that the output channel was blocked; and
– when we attempt to unblock it.
Depending on the implementation of the goroutine scheduler, the output channel could have become empty before we read it, therefore blocking the middle “unclogger” goroutine

Jesse

November 24, 2013 at 10:38 pm

Lukas says:

AFAIKS there is a race condition in the example that will cause a deadlock. When sending to outputChannel is not possible, the default case tries to drain one value from it by receiving from it (“<-r.outputChannel").
Notice however that another goroutine may receive the blocking value from outputChannel before the default case get's executed. The RingBuffer{} now waits forever on the "<-r.outputChannel" line.

One solution might be to wrap the receiving operation in another select statement with an empty default case (so it always proceeds).

June 30, 2014 at 2:46 am

Stephan Hagemann says:

Yes, I believe, as Jesse was noting, this could happen if the select picks the unclogger branch and before the unclogger gets to take the value off the channel the channel gets emptied externally. I don’t think we ever observed this in production. If I remember correctly, the length of the buffer was 20 by default. As such, the buffer would have to be full at 20 items, pick the unclogger branch, have the entire channel drained before the unclogger does anything. Now, the routine would be locked.