Protobuf parsing in Python

2000 words by masci

This blog post was hosted in the Datadog Engineering Blog, you can
read it here.

Recently we extended the Datadog Agent to
support extracting additional metrics from Kubernetes using the
kube-state-metrics
service. Metrics are exported through an HTTP API that supports
content negotiation
so that one can choose between having the response body in plain text format or
as a binary stream encoded using Protocol buffers.

Binary formats are generally assumed to be faster and more efficient, but being
Datadog we wanted to see the data and quantify the improvement. We hope the
results documented here will help save you time and improve performance in your
own code. But before we dive into our findings, let’s start with Protocol
buffers 101.

A gentle introduction to Protocol buffers

Protocol Buffers is a way to
serialize structured data into a binary stream in a fast and efficient manner.
It is designed to be used for inter-machine communication and remote procedure
calls (RPC). This can be used in many different situations, including payloads
for HTTP APIs. To get started you need to learn a simple language that is used
to describe how your data is shaped, but once done a variety of programming
languages can be used to easily read and write Protobuf messages - let’s see a
simple example in Python.

Let’s say we want to provide a list of metrics through an HTTP endpoint: a
metric is something with a name, a string identifier for the type, a floating
point number holding the value and a list of tags, simple strings like
“env:prod” or “role:db” we can use later to filter, aggregate, and compare
results. We could simply print out metric values in plain text with a known
encoding, using some sort of field separator and a newline character to delimit
every entry, something we can implement with a simple format. With Protocol
buffers instead we need to save a text file (we can name it metric.proto) that
contains the following code:

Messages in the real world can be way more complex but for the scope of the
article we will try to keep things simple. You can dive deeper browsing the
official docs, namely the
language definition
and the Python tutorial.

As we mentioned, the .proto file alone is not enough to use the message, we
need some code representing the message itself in a programming language we can
use in our project. A tool called protoc (standing for Protocol buffers
compiler) is provided along with the libraries exactly for this purpose: given
a .proto file in input, it can generate code for messages in several different
languages.

We only need the Python code, so after installing protoc we would execute the
command: protoc --python_out=. metric.proto

The compiler should generate a Python module named metric_pb2.py that we can
import to serialize data:

Streaming Multiple Messages

That is neat but what if we want to encode/decode more than one metric from the
same binary file, or stream a sequence of metrics over a socket? We need a way
to delimit each message during the serialization process, so that processes at
the other end of the wire can determine which chunk of data contains a single
Protocol buffer message: at that point, the decoding part is trivial as we have
already seen.

Unfortunately, Protocol Buffers is not self-delimiting and even if this seems
like a pretty common use case, it’s not obvious how to chain multiple messages
of the same type, one after another, in a binary stream. The documentation
suggests prepending the message with its size to ease parsing, but the python
implementation does not provide any built in methods for doing this. The Java
implementation however offers methods such as parseDelimitedFrom and
writeDelimitedTo which make this process much simpler. To avoid reinventing
the wheel and for the sake of interoperability, let’s just translate the Java
library’s functionality to Python, writing out the size of the message right
before the message itself. This is also the way kube-state-metrics API chains
multiple messages.

This is quite easy to achieve, except that the Java implementation keeps the
size of the message in a Varint value. Varints are a serialization method that
stores integers in one or more bytes: the smaller the value, the fewer bytes you
need. Even if the concept is quite simple, the implementation in Python is not
trivial but stay with me, there is good news coming.

Protobuf messages are not self-delimited but some of the message fields are.
The idea is always the same: fields are preceded by a Varint containing their
size. That means that somewhere in the Python library there must be some code
that reads and writes Varints - that is what the google.protobuf.internal
package is for:

As you can see, _DecodeVarint32 is so kind to return the new position in the
buffer right after reading the Varint value, so we can easily slice and grab the
chunk containing the message.

At this point you may wondering why bother with Protocol buffers if we need to
introduce a compiler and write Python hacks to do something useful with that.
Let’s try to provide an answer,
measuring all the things.

Benchmarks anyone?

There are a number of reasons why people need to serialize data: sending
messages between two processes in the same machine, sending them across the
internet, or both - all of these use cases imply a very different set of
requirements. Protocol buffers are clever and efficient but some optimizations
and perks provided by the format are more visible when applied to certain data
formats, or in certains environments: whether this is the right tool or not,
that should be decided on a case by case basis.

Let’s start having a look at the size of the payload produced by encoding a
bunch of messages, both with Protocol buffers and using plain text. For simple
data like our Metric, binary encoding 100k messages takes 3.7Mb on disk that
shrink to about 200 Kb when gzip compressed. If we try to encode the same
message with a naive, streamable, newline-delimited text format like
sys.cpu,gauge,[my_tag,foo:bar],0.99\n that takes 3.4 Mb on disk and about
180 Kb once zipped to store the same 100k entries.

Results are similar with a real world example: a payload returned by the
kube-state-metrics API containing about 60 messages, encoded with Protocol
buffers and gzip compression, takes about 6Kb; the same payload encoded with
the Prometheus text format and gzip compression has pretty much the same size.

This is somehow expected since strings in Protobuf are utf-8 encoded and our
message is mostly text. If your data looks like our Metric message and you
can use compression, payload size should not be a criterion to choose Protocol
buffers over something else. Or to not choose it.

Speaking of performance, Protocol Buffers should be faster than several other
protocols but in practice, it shows how the Python implementation is extremely
slow. Because of that, pretty much everyone working with Python and Protobuf
uses an optimized version of the library, equipped with a C++ extension; the
current version is not pip installable but it’s fairly easy to build and install
and it provides a huge boost on the performance side.
These are the results using Python 3.5.1 with protobuf 3.1.0 on a MacBook Pro
(13-inch, Early 2015) to encode our Metric type, with and without Protobuf’
cpp extension, compared to the bogus text encoder from a few paragraphs ago.

The encoding phase is where Protobuf spends more time. The cpp encoder is not
bad, being able to serialize 10k messages in about 76ms while the pure Python
implementation takes almost half a second. For one million messages the pure
Python protobuf library takes about 40 seconds so it was removed from the chart.

Decoding results are better, with the cpp implementation able to deserialize
10k messages in less than 3ms; using the oversimplified plain text decoder is
like cheating so we don’t plot the values that would be close to zero.

Now that we have a rough idea of what we should expect from Protobuf in terms
of performance, let’s get back to our use case: should we use it to parse
kube-state-metrics? Or should we use the plain text format?

The (short) road to Protobuf

After playing a bit with Protobuf, it wasn’t obvious whether to choose it or
not to implement the kubernetes_state check. We had to introduce about 5Mb in
binary dependencies to the agent (at this point it should be clear that using
the pure Python version of the library is futile and the C++ runtime does not
come for free) and the benchmarks seemed to show moderate gains over processing
plain text. But trying to look at the entire picture makes things way more clear.

A Python version of the protocol used by the API is publicly available and can
be included in our codebase so we can ignore the overhead in terms of setup and
tooling caused by the compilation process.

In terms of payload size, being the API server capable to compress HTTP
responses, we can choose any of the two formats (Protobuf and plain text)
provided by the kube-state-metrics API without changing the overall performance
of the check.

It’s important to note we access data that is already serialized, meaning that
encoding performance is someone else’s problem; we only need to focus on what’s
the impact of decoding messages on the overall speed of our check. These are the
results of processing 60 metrics from the kube-state-metrics API in both plain
text and protobuf. The text was processed with the Python parser implemented in
the Prometheus client library.

As you can see, parsing complex data in text format is very different from our
simple metric message: the prometheus parser has to deal with multiple lines,
comments, and nested messages. Putting aside the pure Python version of the
library, parsing the binary payload outperforms plain text by one order of
magnitude, crunching 60 metrics in about 2 milliseconds - the payload transfer
will always be the bottleneck, no matter what.

Even if the Protobuf Python library does not support chained messages out of
the box, the code needed to implement this feature is less than 10 lines. The
data we get is structured, so once the message is parsed we are sure that a
field contains what we expect, meaning less code, less tests, less hacks, less
odds of a regression if the API changes.