When sending data over the network or storing it in a file, we need a
way to encode the data into bytes. The area of data serialization has
a long history, but has evolved quite a bit over the last few
years. People started with programming language specific serialization
such as Java serialization, which makes consuming the data in other
languages inconvenient. People then moved to language agnostic formats
such as JSON. However, formats like JSON lack a strictly defined
format, which has two drawbacks. First, they are verbose because field
names and type information have to be explicitly represented in the
serialized format. Second, the lack of structure makes consuming data
in these formats more challenging because fields can be arbitrarily
added or removed.

More recently, a few cross-language serialization libraries have
emerged that require the data structure to be formally defined by some
sort of schemas. These libraries include Thrift , Protocol Buffer and Avro . The advantage of having a schema is that
it clearly specifies the structure, the type and the meaning (through
documentation) of the data. With a schema, data can also be encoded
more efficiently. Among those schema-aware serialization libraries, we
particularly recommend Avro because the authors have put in more
thoughts on how to evolve the schemas. Next, we briefly introduce Avro
and how it can be used to support typical schema evolution.

An Avro schema defines the data structure in a JSON format. The
following is an example Avro schema that specifies a user record with
two fields: name and favorite_number of type string and int,
respectively.

One of the interesting things about Avro is that it not only requires
a schema during data serialization, but also during data
deserialization. Because the schema is provided at decoding time,
metadata such as the field names don’t have to be explicitly encoded
in the data. This makes the binary encoding of Avro data very compact.

An important aspect of data management is schema evolution. After the
initial schema is defined, applications may need to evolve it over
time. When this happens, it’s critical for the downstream consumers to
be able to handle data encoded with both the old and the new schema
seamlessly. This is an area that tends to be overlooked. However,
without thinking through this carefully, people often pay the cost
later on.

There are three common patterns of schema evolution: backward
compatibility, forward compatibility, and full compatibility.

First, consider the case where all the data is loaded into HDFS and we
want to run SQL queries (e.g., using Apache Hive) over all the
data. To support this kind of use case, we can evolve the schemas in a
backward compatible way: data encoded with the old schema can be read
with the newer schema. Avro has a set of rules on
what changes are allowed in the new schema for it to be backward
compatible. If all schemas are evolved in a backward compatible way,
we can always use the latest schema to query all the data
uniformly. For example, an application can evolve the previous user
schema to the following by adding a new field favorite_color.

Note that the new field has a default value “green”. This allows data
encoded with the old schema to be read with the new one. The default
value specified in the new schema will be used for the missing field
when deserializing the data encoded with the old schema. Had the
default value been ommitted in the new field, the new schema would not
be backward compatible with the old one since it’s not clear what
value should be assigned to the new field, which is missing in the old
data.

Second, consider another use case where a consumer has application
logic tied to a particular version of the schema. When the schema
evolves, the application logic may not be updated
immediately. Therefore, we need to be able to project data with newer
schemas onto the schema that the application understands. To support
this use case, we can evolve the schemas in a forward compatible way:
data encoded with the new schema can be read with the old schema. For
example, the above new user schema is also forward compatible with the
old one. When projecting data written with the new schema to the old
one, the new field is simply dropped. Had the new schema dropped the
field favorite_color, it would not be forward compatibile with the
original user schema since we wouldn’t know how to fill in the value
for favorite_color for the new data.

Finally, to support both previous use cases on the same data, we can
evolve the schemas in a full compatible way: old data can be read with
the new schema and new data can also be read with the old schema.

As you can see, when using Avro, one of the most important things is
to manage its schemas and reason about how those schemas should
evolve. Confluent’s Schema Registry is built for exactly that
purpose. You can find out the details on how we use it to store Avro
schemas and enforce certain compatibility rules during schema
evolution by looking at the Schema Registry API Reference.