OpenCensus for Go gRPC developers

In this tutorial, we’ll examine how to use OpenCensus in your gRPC projects in the Go programming language for observability both into your server and then client! We’ll then examine how we can integrate with OpenCensus exporters from AWS X-Ray, Prometheus, Zipkin and Google Stackdriver Tracing and Monitoring.

gRPC is a modern high performance framework for remote procedure calls, powered by Protocol Buffer encoding. It is polyglot in nature, accessible and useable on a variety of environments ranging from mobile mobile devices, general purpose computers to data centres for distributed computing, and it is implemented in a variety of languages: Go, Java, Python, C/C++, Node.js, Ruby, PHP https://grpc.io/

OpenCensus is a modern observability framework for distributed tracing and monitoring across microservices and monoliths alike. It is polyglot in nature, accessible and useable too on a variety of environments from mobile devices, general purpose computers and data centres for distributed computing and it is implemented in a plethora of languages: Go, Java, Python, C++, Node.js, Ruby, PHP, C#(coming soon) https://opencensus.io/

Go is a modern programming language that powers the cloud as well as modern systems programming, making it easy to build simple, reliable and efficient software. It is a cross platform, fast, statically typed and a simple language https://golang.org

With the above three introductory paragraphs, perhaps you already noticed the common themes: high performance, distributed computing, modern nature, cross platform, simplicity, reliability — those points make the three a match #compatibility, hence the motivation for this tutorial/article.

For this tutorial, we have a company’s service that’s in charge of capitalizing letters sent in from various clients and internal microservices using gRPC.

To use gRPC, we firstly need to create Protocol Buffer definitions and from those, use the Protocol Buffer compiler with the gRPC plugin to generate code stubs. If you need to take a look at the pre-requisites or a primer into gRPC, please check out this article https://grpc.io/docs/

Our service takes in a payload with bytes, and then capitalizes them on the server.

Payload Message and Fetch service

To generate code, we’ll firstly put our definition in a file called “defs.proto” and move it into our “rpc” directory and then run this command to generate gRPC code stubs in Go, using this Makefile below:

Makefile

make should then generate code that’ll make the directory structure look like this

After the code generation, we now need to add the business logic into the server

Plain Server

Our server’s sole purpose is to capitalize content sent in and send it back to the client. With gRPC, as previously mentioned, the protoc plugin generated code for a server interface. This allows you create your own custom logic of operation, as we shall do below with a custom object that implements the Capitalize method.

server.go

With that, we can now monetize access to generate money $$$. In order to accomplish that though, we need to create clients that speak gRPC and for that please see below:

Plain Client

Our client makes a request to the gRPC server above, sending content that then gets capitalized and printed to our screen. It is interactive and can be run simply by go run client.go

which when run interactively, will look like this

interactive response from the client

And now that we have a client, we are open for business!!

Made that money

Aftermath

It’s been 1 hour since launch. Tech blogs and other programmers are sharing news of our service all over their internet and social media; our service just got so popular and is being talked about all around the business world too, high fives are shared and congrats shared — after this celebration, we all go back home and call it a night. It’s the latest and greatest API in the world, it is off the charts, customers from all over the world come in, what could go wrong?

It hits 3AM and our servers start getting over loaded. Response time degrades overall for everyone. This however is only noticed after one of the engineers tried to give a demo to their family that they restlessly awoke at 2:00AM due to excitement, but the service is taking 15ms to give back a response. In normal usage, we saw about at most 1ms response time. What is causing the sluggishness of the system? When did our service start getting slow? What is the solution? Throw more servers at it? How many servers should we throw at it? How do we know what is going wrong? When? How can the engineering and business teams figure out what to optimize or budget for? How can we tell we’ve successfully optimized the system and removed bottlenecks?

Flying high until our servers came crashing down!

In comes in OpenCensus: OpenCensus is a single distribution of libraries for distributed tracing and monitoring for modern and distributed systems. OpenCensus can help answer mostly all of those questions that we asked. By “mostly”, I mean that it can answer the observability related questions such as: When did the latency increase? Why? How did it increase? By how much? What part of the system is the slowest? How can we optimize and assert successful changes?

OpenCensus is simple to integrate and use, it adds very low latency to your applications and it is already integrated into both gRPC and HTTP transports.

OpenCensus allows you to trace and measure once and then export to a variety of backends like Prometheus, AWS X-Ray, Stackdriver Tracing and Monitoring, Jaeger, Zipkin etc. With that mentioned, let’s get started.

Part 1: observability by instrumenting the server

To collect statistics from gRPC servers, OpenCensus is already integrated with gRPC out of the box, and one just has to import go.opencensus.io/plugin/ocgrpc. And then also subscribe to the gRPC server views. This amounts to a 7 line change

and then to trace the application, we’ll start a span on entering the function, then end it on exiting. This amounts to a 7 line change too

In the tracing, notice the trace.StartSpan(ctx, "(*fetchIt).Capitalize") ?We take a context.Context as the first argument, to use context propagation which carries over RPC specific information about a request to uniquely identify it

How do we examine that “observability”?

Now that we’ve got tracing and monitoring in, let’s export that data out. Earlier on, I made claims that with OpenCensus you collect and trace once, then export to a variety of backends, simulatenously. Well, it is time for me to walk that talk!

To do that, we’ll need to use the exporter integrations in our app to send data to our favorite backends: AWS X-Ray, Prometheus, Stackdriver Tracing and Monitoring

to finally give this code

OpenCensus instrumented server.go

and with the following variables set in our environment

AWS_REGION=region

AWS_ACCESS_KEY_ID=keyID

AWS_SECRET_ACCESS_KEY=key

GOOGLE_APPLICATION_CREDENTIALS=credentials.json

as well as our prometheus.yml file

prometheus --config.file=prometheus.yml

go run server.go

2018/05/12 11:40:17 fetchIt gRPC server serving at ":9988"

Monitoring results

Prometheus latency bucket examinations

Prometheus completed_rpcs examination

Prometheus sent_bytes_per_rpc_bucket examination

Stackdriver Monitoring completed_rpcs examination

Stackdriver Monitoring server_latency examination

Tracing results

Common case: low latency on the server

Postulation: pathological case of inbound network congestion

Postulation: pathological case of outbound network congestion

Stackdriver Trace — common case, fast response, low latency

Postulation: system overload on server hence long time for bytes.ToUpper to return

Postulation: outbound network congestion

Postulation: inbound network congestion

Part 2: observability by instrumenting the client

and then for client monitoring, we’ll just do the same thing for gRPC stats handler except using the ClientHandler and then also start and stop a trace and that’s it, collectively giving this diff below

or this which now becomes this code

which gives this visualization

Engineers can add alerts with Prometheus https://prometheus.io/docs/alerting/overview/ or Stackdriver Monitoring https://cloud.google.com/monitoring/alerts/ but also the various teams can examine system behaviour simultaneously, be it traces or metrics on a variety of backends. A question one might have is: “how about observability for streaming?” — for streaming you can use the same logic, but since in order to export a trace, the span needs to have been ended. However, with streaming, you have a single persistent connection that’s perhaps infinitely open. What you can do is register unique identifying information from a streaming request and then per stream response, start and end a span!