Istio, Envoy and Honeycomb

Istio, Envoy and Honeycomb

Here at the hive, we’re exceedingly excited about the emerging future of the “service mesh”. Deploy a sidecar proxy such as Envoy in your infrastructure, and you get consistent support for advanced traffic control, fault injection, request-level observability, and other powerful features for every service. That’s a mighty useful tool to have when operating distributed systems.

In particular, this technology makes canary deployments and incremental rollouts dramatically simpler — particularly when used with a control plane such as Istio.

But the whole point of a canary is lost if you can’t usefully observe it! In order to deploy with confidence, you need to know that your canary is receiving the fraction of traffic that you expect, that you’re seeing the changes you want to see, and that you’re not seeing changes you *don’t *want.

Now, you can certainly publish and alert on key metrics — request rate, latency, error rate. But it’s tough to predict every metric you might ever want, and emit them all separately. Especially when you’re making very focused changes. Change the implementation of a REST API endpoint, and you’ll want to see the effect on that endpoint. Fix a customer’s issue, and you’ll want to see the effect for that customer.

Fortunately, you can use Honeycomb to parse Envoy access logs, and slice, dice, or julienne the events that they represent to get the numbers you care about.

How does that work in practice? There are different ways to deploy Envoy, but in this post, we’ll talk about the case that you’re running Istio in a Kubernetes cluster. The Envoy container is deployed as a sidecar alongside your application containers. By deploying the Honeycomb Kubernetes agent and instructing it to parse Envoy access logs, you get immediate, cluster-wide, language-agnostic visibility into every request that’s served.

By breaking down our traffic by the version label, we can check that the canary deployment is receiving an appropriate amount of traffic. The Honeycomb agent automatically augments Envoy logs with Kubernetes metadata, so we can easily break down our request events by pod labels, pod UID, node, and so on without any application code changes.

In order to verify that the canary is healthy, we can look at overall response time:

Oha! While median response time is essentially unchanged, it looks like there’s a regression in tail latency in this deployment. What now? Well, as an operator, you might choose to roll back this deployment. But the ability to break down latency and error rates by endpoint, user ID or any other criterion you like makes identifying the source of the regression much faster. In this case, let’s restrict our query to only include slow requests, by filtering on duration > 220, and breaking down those slow requests by API endpoint in addition to app version. Now we can see that the regression specifically affects the /books endpoint in version 2 of the app, but not other endpoints.

In Conclusion

We’ve barely scratched the surface here. If you’re running Envoy (or a more traditional proxy such as HAProxy or NGINX), this level of visibility into individual requests is pretty much the bare minimum. But there’s a lot more you could be doing, including passing application-specific headers back into the access log, using Envoy’s request_id to follow requests across services, and more. We’ll cover that in future posts — in the meantime, don’t hesitate to get in touch if you’d like to learn more!

We use cookies or similar technologies to personalize your online experience and tailor marketing to you. Many of our product features require cookies to function properly. Your use of this site and online product constitutes your consent to these personalization technologies. Read our Privacy Policy to find out more.