Cloud debugging with distributed tracing

Modern clouds are large, complex, distributed systems. They are usually implemented with a variety of software components that run independently on various remote nodes that communicate with each other over the network.

Debugging and monitoring local systems is a well-documented process, thanks to a plethora of tools and APIs. Network debugging and monitoring at the cloud level, however, is another beast entirely. Due to the number of software components at stake and the heterogeneous nature of the underlying network protocols and remote procedure calls, getting a clear, timely, ordered, and structured view of a cluster infrastructure status and logs proves to be challenging. Finding and implementing cloud-specific tools and architectures that can overcome that complexity is of utmost importance for fixing performance and scalability issues, or for debugging system crashes or regressions in such environments.

Here we will present and describe a few of these tools and explain how we're planning to base our ciao distributed tracing infrastructure on some of them.

What's so hard about distributed tracing?

Rich and exhaustive system logging is an operating system commodity feature. So what is preventing us from applying the same techniques to distributed architectures?

Let's first look at what we want to achieve with distributed tracing and logging:

Ease of access: We want to be able to easily access -- from one single interface -- all cluster traces and logs originating from all nodes. This typically means that each node on the system will periodically push logging data to a collecting entity.

Reliability: We want the cluster logs and traces to represent what is really happening on the system. Note that reliability does not necessarily imply completeness; that is, it may be possible to get a reliable view of the system without getting each and every trace from each node.

Scalability: A distributed tracing architecture should scale, and ability to scale should not be a limiting factor for cluster deployments.

Low overhead: Tracing and logging should use minimal CPU power, memory, and networking-footprint.

Tasks such as these should have a negligible performance impact on applications running on the cluster and should not be immediately turned off on highly optimized deployments.

Tasks such as these should be able to run permanently.

Application transparency: Enabling a distributed tracing infrastructure should not mean having to explicitly instrument all applications. However, the system should offer an annotation API for application implementations to leverage the tracing features at will.

Security and privacy: Traces should not include RPC payloads, as they can contain personal and/or confidential information. Ideally, traces should be collected through an encrypted medium.

Now let's think about what it would take to build an architecture around these requirements for... say, a 1000-node cluster running 100 containers and 50 virtual machines each. We're looking at 1000 tracing agents potentially reporting dozens of gigabytes of traces per hour. This magnitude of data can be handled through these potential solutions:

Adaptive sampling: On one hand, trace data volumes can be extremely high. On the other hand, traces typically follow periodic patterns. We can then collect and store trace samples and still get an accurate cluster view. Ideally, the trace sampling should be adaptive, depending on the available cluster bandwidth and tracing performance overhead. At a fixed and low sampling rate, singular but important events may be lost.

Rate limiting: Although a distributed tracing architecture will provide annotation APIs for applications to use, it should also apply a rate limit to it and prevent overzealous applications from cluttering the tracing pipeline.

Out-of-band trace collection: In order to have minimal overhead, a distributed tracing design will ideally collect traces "out-of-band" relative to the events it traces. The design should use a dedicated, QoS-controlled, networking bandwidth; and it should not include traces in the cluster control plane payloads themselves. The weight of this overhead should be obvious.

Set of collectors: Trace collectors should follow a one-way pipeline, reading logs from the cluster nodes and writing them back into a storage back-end. If they do so, the set of collectors can be expanded and scaled, according to the needs of the tracing workload.

Contextual trace relationships: Storing traces is one part of the problem. Being able to make sense out of them is the other. Most of the time, traces have causal relationships; that is, a trace gets generated as part of an event that was caused by another traced event. Here we have at least two traces that relate to each other. Any distributed tracing architecture will provide a way for applications to build these relationships by propagating traces contexts. Some of them will also infer the causal dependencies of the traces.

Existing tracing tools

Several existing projects implement a distributed tracing architecture; these share many of the above-mentioned requirements and concepts.

Internally, Google uses Dapper*, which is based, at least partially, on a few older projects like Magpie* and X-Trace*. Although Dapper is not open sourced, Google published a foundational paper about it, and there are a few open source projects that are actual Dapper implementations:

Zipkin* is Twitter's open source implementation of Dapper. It is a Java and JavaScript* implementation deployed in Twitter's production environment. Zipkin's front-end instruments the Finagle RPCs; and as such, is tightly-coupled with Twitter's cloud architecture.

Brave* is a Java re-implementation of Zipkin's front-end. One can combine Zipkin's back-end services with Brave in order to have a more Twitter independent distributed tracing solution.

AppDash* is a Go distributed tracing implementation, based on Dapper and Zipkin. Besides the fact that AppDash is a go implementation, another interesting feature is its OpenTracing implementation. OpenTracing is an attempt at standardizing distributed tracing APIs; specifically, the application instrumentation ones.

X-Trace is an open source, Java-based distributed tracing project led and funded by the Brown University. It is the main design Dapper is based on; and as such, most of the above-mentioned projects are X-Trace based as well. One very interesting X-Trace derivative is Pivot Tracing, which takes distributed tracing to the next level. Pivot Tracing aims at letting users define their own monitoring primitives that can be installed at run-time, without having to re-deploy the system.

Ciao tracing solution

Ciao implements its own distributed tracing architecture, which aims at monitoring and logging ciao clusters. It is not a generic tracing solution, and is not meant to be used by applications or virtual machines running on top of ciao's orchestration. Instead it instruments the SSNTP protocol and all ciao components and libraries for getting an accurate and complete cluster status.

Ciao tracing is somewhat inspired by Dapper and shares many design concepts with existing open source tracing solutions. However, following ciao's tight integration strategy, it is highly integrated and customized for ciao's needs. Except for adaptive sampling, it follows all of the above mentioned requirements and solutions.