Istio in Production?

Istio is one of the most popular service mesh. It can help in solving many issues that surface when running a lot of microservices – things like authentication, authorization, observability and traffic routing. It all sounds really promising, so we decided to give it a try at Soluto. During the process of deploying it on an existing cluster and enabling it on existing workloads, I faced a lot of interesting issues. Let me share some of them with you.

A production deployment?

Istio is a really complex product. It has a few moving parts that are required for a functional deployment. This diagram shows all Istio components, you can learn more about it on the official docs:
Istio Components Diagram – source: Istio website
Which got me wonder – what do I need to monitor? Which one of those components is critical and which not? Do we need to wake up in the middle of the night if Mixer is down? Unfortunately, I couldn’t find clear answers to these questions. Istio has really good Grafana dashboards for each one of those components. Using the dashboards, you can find important metrics to monitor like Pilot push errors. Does that enough? I don’t know. Besides monitoring, it’s also important to ensure that all those components are highly-available:

In addition, Istio has a dependency on other tools like Grafana, Prometheus, Jaeger, and Kiali. Istio can install all of them – but only for demo purposes. Running them in production is feasible only by installing them with Helm/relevant operator. And at this point, you also have to solve issues like Kiali authentication or Jaeger storage.

A safe rollout

After installing Istio, the next step is to start rolling it out onto the existing workloads. By enabling side-car injection, we can enable Istio on the relevant workload. The injector use namespace labels for detecting the relevant workloads (see injection rules here). So I decided to enable it on one namespace and see what will happen. Apparently, things did not end up well:
Number of pods over time
What happened? I deleted all the pods in the namespace (never do that!), so all the new pods go re-created. The namespace was labeled, so the sidecar injected. And for some reason, the proxy used a lot of CPU, more than what it requested. The HPA noticed that, and scale up the pods. Which did not help – so it keeps scaling them up. The solution was to define larger requests for the proxy using the following annotations:

The Hidden Cost

Istio is an open-source, but still – you’re going to pay for using it. Why? Because it’s adding another sidecar (Envoy proxy), running on all the pods in your clusters. This was not something I was thinking about before starting to play with Istio. So the question is – how much it is going to cost?
This really depends on how much resources Istio consumes. I noticed that on a few services it consumes less than 10ml CPU, but on others, it can take up to 800ml CPU (see the previous section). Why? Not sure. Istio has a pretty good page on performance, but it did not match what we experienced (see the issue here):
Istio Container CPU usage by pod name
See the highest bars? These are containers on the Grafana pod – which almost don’t get any traffic. So to answer the cost question – it really depends, and the only way to know for sure is by trying. One thing to notice – the default CPU request for the proxy is 100ml CPU, which is a lot more than it needs. Setting it to a lower value can save you a lot of money:

Tracing is not trivial. Even with Istio, you still need to propagate the tracing headers. Meaning, you still need to instrument your code with tracing. Istio just makes it a bit easier to instrument your code. Also, if you’re using Nginx ingress you might want to enable it’s tracing module.

Wrapping Up

Istio is a very powerful tool, but deploying it into an existing cluster is not trivial at all. If you do decide to follow this route, remember that:

This is going to be a very long journey. A safe rollout is slow and manual, that requires a very careful restart of all the workloads on the cluster, and in some cases fine-tuning.

Istio has it’s own price – prepare to see an increase in resource usage!

Production deployment is not trivial. I touched only some issues here, but things are a lot more complex when you have multiple clusters.

Looking forward, I do hope to solve those issues and be able to leverage service mesh in production soon. Maybe trying AWS AppMesh or Linkerd could help with some of those issues. As an alternative, using API Gateway (like Gloo or Kong) could give us some of the value, at a lot smaller cost. Where are we heading? It’s still too early to say 🙂