THE SALSIFY SMARTER ENGINEERING BLOG

Adventures in Avro

As part of our microservices architecture we recently adopted Avro as a data serialization format. In the process of incorporating Avro we created a Ruby DSL for defining Avro schemas, developed a gem for generating models from Avro schemas, and reimplemented an Avro schema registry in Rails. Here’s how we got there ...

Here’s a situation that may be familiar: at Salsify we are moving towards a microservices architecture. Have you done that too? Are you thinking about doing it? This is a pretty common progression for startups that built a monolithic application first, found great market fit and then need to scale both the application and the team. At Salsify, we already have quite a few services running outside of the original monolith, but several months ago we started to define an architecture for how we should chip away at the monolith and structure new core services.

Naturally part of the architecture we are defining is how services should communicate with each other. For synchronous communication, we decided to stick with HTTP REST APIs that speak JSON. For asynchronous communication, we selected Apache Kafka.

We evaluated several data serialization formats to use with Kafka. The main contenders were Apache Avro, JSON, Protocol Buffers and Apache Thrift. For asynchronous communication we wanted a data format that was more compact and faster to process. Asynchronous data may stick around longer and the same message may be processed by multiple systems so version handling was important. A serialization system should also provide additional benefits like validation and strong typing.

At this point, I should inject that we’re primarily a Ruby shop. We love our expressive, dynamic language of choice, so a big factor in the selection process was how well the framework integrates with Ruby. Based on the title of this post, it's not going to be any surprise which option was the winner ...

Evaluating Avro

We chose to use Avro for data serialization. There were obviously many things that we liked about Avro, but also some shortcomings that we knew we'd have to work around.

On the positive side for Avro, it is used by Confluent, the Kafka people, for their Schema Registry (more on that later). Also Avro supports dynamic clients in languages like Ruby, whereas Apache Thrift and Protocol Buffers require code generation using framework supplied utilities.

Another attractive feature of Avro is the concept of a canonical form for a schema and a corresponding fingerprint that can be calculated for the schema to uniquely identify it. Avro also supports compatibility checks between schema versions to help ensure that breaking changes are not introduced for a message format. Lastly, Avro has good adoption in the analytics community, which is a growth area of work for us this year.

Now for the downsides that we saw. The Avro support for Ruby is not as complete as for other languages. There are no model objects created to represent the data for a schema so you are just working with hashes in Ruby. There is no import or include functionality (without the IDL) to break schema definitions across multiple files

After picking Avro, we were ready to take advantage of the things that we liked and to set about fixing some of these shortcomings we'd identified.

Defining Schemas

Avro schemas are defined in JSON. There is also an interface definition language (IDL) that can be used to define protocols for Avro RPC.

The Avro IDL looked preferable to the Avro JSON (we’ll come back to that) and it supports imports to reuse definitions. The Avro IDL is used to define protocols, but schema definitions can be extracted from a protocol file. That extraction requires using the avro-tools jar, but since this utility was only required during schema definition we would have been willing to compromise on it. The breaking point for using the IDL was that the import paths must be relative. This doesn’t work well for sharing definitions across multiple projects.

As Rubyists we love our DSLs. So instead of writing Avro JSON schemas directly we decided to write avro-builder, a DSL for defining Avro schemas in Ruby, that provides support for the things we wanted. And was more pleasant to write.

avro-builder

Here’s an example of an Avro schema in JSON from the Avro specification:

This defines a record named LongList. The record has an alias LinkedLongs that provides compatibility with versions of the schema that use that name. The record being defined here has two fields: value and next. The next field is an example of a union in Avro. This example of a union with null is how an optional field is represented in Avro.

Lots of boilerplate and strange conventions, right? Here’s the same definition using avro-builder:

We chose to implement field definition in the DSL using required and optional keywords. Each field takes a name and a type argument, and accepts a number of other options that depend on the type.

In avro-builder, we also implemented support for splitting definitions across files. The Avro::Builder module can be configured with load paths that will be searched for a DSL file with a name that matches the reference: Avro::Builder.add_load_path('/path/to/dsl/files').

The following example shows a record that would be defined in one file and then referenced by name in a schema definition in another file:

With avro-builder we have a more ergonomic way to define Avro schemas and share definitions across projects. We do this by storing definitions in gems (as Ruby files in the avro-builder DSL), and referencing those shared definitions via paths within the gem. Within a top-level avro directory in a project we typically create dsl and schema subdirectories for DSL files and generated schemas. To share DSL files from a gem we either define a constant that is set to provide the gem's location or use Gem::Specification:

Since Avro does not require code generation with dynamically typed languages like Ruby, we even have the option of using avro-builder at runtime to generate a schema that can then be parsed and used by the avro gem for Ruby. In practice we haven’t gone this route except for temporary schemas in tests. In applications, we prefer to generate the Avro JSON schema files (.avsc) from the DSL and commit those artifacts as well as the DSL files.

The Schema Registry

Unique to Avro, among the serialization frameworks we considered, is the requirement that the schema must always be present when reading serialized data. Avro fields are identified by name rather than user assigned ids. Fields are serialized in the order that they appear in the schema, so the field names do not need to appear in the serialized data. Since Avro does not rely on generated code, the schema is required to interpret the encoded data. In fact when decoding data both a reader's and a writer's schema can be supplied to handle mapping between different versions.

Since the Avro schema always has to be present for reading and writing we really like the idea of the Confluent Schema Registry. The schema registry stores different versions of a schema under a subject. The subject is just an arbitrary name used in the registry to identify a schema and we decided to use the schema's full name, including namespace, for this value. Each version of the schema is assigned an integer id. Attempting to re-register the same schema returns the same id, and the id can be used to retrieve the schema from the server.

Using a schema registry, only the registry assigned id for the schema needs to be sent along with an Avro-encoded data instead of the full JSON for the schema. Then when the data needs to be decoded the original schema can be retrieved by id from the registry.

Although we liked the idea of the schema registry, there were parts of the implementation that we were less enthusiastic about. The Confluent Schema Registry stores all of the schema versions in Kafka. We think of our messages in Kafka as more transient, but our schemas are something that we hope to keep around for a long time. We wanted to retain some flexibility in how we host Kafka. We might have multiple Kafka clusters, but we want a single schema registry. And backup tools are just more mature for relational databases at this point. For all these reasons, we decided to reimplement the API of the schema registry as a Rails application and to use a relational database (Postgres) for persistence.

The avro-schema-registry implements the same API. We host an instance of this application publicly at avro-schema-registry.salsify.com that anyone can experiment with, just please don't rely on it for production! There is also a Heroku Button to easily deploy your own copy of the app.

Building Models

One of our criteria when evaluating serialization frameworks was a solution that did not require generated code. Avro fit that criteria, but the downside of having no generated code in this case is that you’re just working with hashes in Ruby. Pass a hash in for Avro to encode, and get a hash back out from after decoding. Working directly with lots of hashes is error prone and leads to terrible-looking code.

In addition to providing better validation, using model objects allows us to attach additional behavior to those objects.

Enter Avromatic

To satisfy our desire for models based on our Avro schemas, we wrote avromatic. This gem allows you to specify a schema and dynamically generate a model. This gives us models that we can use in our applications while still avoiding statically generated code:

Avromatic can also be used to generate a module that can be included in a class to add the attributes for the schema allowing us to still leverage inheritance:

It is even possible to go directly from the avro-builder DSL to a model.

Schema Store

In the examples above where the Avro schema is referenced by name, the Avro JSON schema is being loaded from the filesystem using a schema store (see AvroTurf::SchemaStore). The schema store loads the previously generated .avsc artifacts. As mentioned above, we prefer to use the generated JSON schema files at runtime and this also makes avromatic independent of the avro-builder DSL. The schema store is configured with a filesystem path and loads schema files that exist under that path based on their full name, including namespace. The schema store returns in-memory schema objects (Avro::Schema) and caches them by their full name.

Model Serialization

A big chunk of the functionality of the generated models is their ability to encode and decode themselves using Avro. For this avromatic uses the AvroTurf::Messaging API. This API integrates with a schema registry, such as avro-schema-registry, to prefix Avro encoded values with an id for the schema, and in reverse decode values based on a schema id prefix. This allows messages to be passed around with the schema id embedded in the message, removing the need to pass around the full JSON schema

In addition to a value, messages published to Kafka have an optional key field that is used to determine the partition for a message. To support this, avromatic generated models allow both a key schema and a value schema to be specified. The attributes of the generated model are then the union of the fields from the two schemas, provided there are no conflicting definitions.

When avromatic is configured with:

a schema store (for loading Avro JSON schema files from local disk)

a schema_registry (for exchanging an Avro JSON schema for a generated id)

an AvroTurf::Messaging object (for Avro encoding data and prefixing with the schema id)

then generated models support avro_message_key and avro_message_value methods for generating the message fields with Avro-encoded values and embedded schema ids:

In the opposite direction, generated models have an avro_message_decode method that accepts the schema id prefixed, Avro-encoded messaging fields (value and optional key) to return a model instance: MyModel.avro_message_decode(message_key, message_value).

To handle Kafka topics that have messages corresponding to multiple different models, avromatic also provides an Avromatic::Model::MessageDecoder class that can be configured with a list of generated models and then decodes each message as the correct model.

Models generated using avromatic also support encoding directly to Avro without requiring a schema registry or the Messaging API. All of this is detailed in the avromatic repository.

The Destination

We're happy with where our adventures with Avro have led us. We have a developer friendly solution for defining schemas that scales aross multiple projects. We have a service for storing all the versions of each of our schemas so that we can always decode a message. And we have models that provide a natural integration between our schemas and our application code. What I've described above more than meets our needs today, but I'm sure these projects will continue to evolve. We are excited to share them with the community and we hope you find them useful too.