Distributed tracing

Plumbr is designed to monitor end user experience. Our solution for this is seemingly simple – Plumbr instruments the application bytecode during deployment and adds tracing code to the endpoints published by the application. Now if the calls arrive to such endpoints, Plumbr is able to gather and analyze data from them.

This concept started to break already in 2014 when the increased uptake of both the cloud and microservice-based architectures started to really build up. As a result, existing monoliths were getting replaced with microservices deployed and dynamically scaled in the cloud:

Deployment of distributed and dynamically provisioned architectures has created the situation where tracing the end user experience in individual node would not expose the entire user experience. Each node in the architecture can

be responsible for servicing only a specific span in context of the end user interaction;

have multiple (dynamically spawned) copies of themselves acting as a cluster.

So it was just natural that we have followed the path industry is taking and added support for such deployment models. As a result, we ended up building something that is called distributed tracing.

This post is first in series describing how we built the support for distributed tracing and which obstacles we needed to tackle. In this post we cover the concept of distributed tracing in general and demonstrate two key pillars our solution was built upon. To some of the readers the concepts might be familiar – the key concepts applied were largely inspired by the research made while building the Google Dapper, so if you have investigated this project, you might recognize familiar aspects.

What is distributed tracing?

Distributed tracing is the concept of tracking one user interaction with the application throughout the architecture deployed on multiple nodes. Capturing such traces allows us to use these individual elements to build the view of the entire chain of calls behind the user interaction. This perspective then allow us to see how different nodes are interacting with each other, linking the root causes for poor performance to a single user interaction:’

Next chapters will describe our approach into building the solution for distributed tracing.

UUID

First problem we needed to solve was related to understanding whether or not events in particular nodes are anyhow linked to the same user interaction. The answer to this problem was generating a universally unique identifier in the first node accepting the interaction and passing it along to the other nodes as call metadata.

The UUID’s (d19931bb-f235-4dcb-2e2f-b9d31225d62e as an example) are attached to the data each node sends to Plumbr Server. Having this information allows us to assemble all the individual spans together, resulting in a distributed trace, similar to the example above.

HTTP headers

The generated UUID needs to be passed along with each call to a remote node. As changing the contract itself between the nodes participating in transaction is something we cannot do, we had to find other means to inject the metadata to the call. The solution ended up being protocol-specific, with the first implementation relying on HTTP protocol.

The method we used for passing along the UUID involved injecting our own custom HTTP header to the downstream calls. So all the HTTP requests departing a node monitored by Plumbr would besides existing headers (Accept, Accept-Encoding, etc) include our custom header with the UUID:

If this header was not present in the request, then we are dealing with a new interaction and we need to generate new UUID to be used in this node and to be passed to downstream calls.

If the header is present in the request, then we are dealing with an interaction that arrived to our system via some other node. So in this case we should not generate new UUID. Instead we should join the ongoing interaction and make sure the downstream calls from this node also get this UUID passed along as header.

Take-away

I hope the post gave you insights into how tracing events in distributed systems can be built. If the picture looks simple and straightforward, rest assured, there were many hairy obstacles we needed to tackle, some of which will be covered in the follow-ups during the forthcoming weeks. If this sounds interesting, start following us in Twitter and be notified on time.