Data Encodings and Layout

Clemens Vasters

Like many other messaging products and services, the services I build with
my team at Microsoft mostly take a neutral stance towards payload data. We
move byte arrays and streams. In fact, in my team we made it a hard principle
to never touch the message payload inside our services. The upside of that
stance is that we can easily support end-to-end payload encryption, because
we don’t attempt to make any decisions based on the content.

But: Applications need to make hard choices about how they encode their data,
and therefore I’m sharing some depth guidance that I initially wrote for an
early draft of the Azure IoT Reference Architecture, but didn’t make it into
the final doc in its entirety due to concerns about the overall size and depth
of the document. The guidance applies quite broadly to messaging and not only
to IoT. This is the “Director’s Cut”:

Introduction

There is a large and also growing number of formats available for the encoding
of structured data for communication pruposes, and the optimal data encoding
choice will differ from use-case to use-case and is sometimes even constrained
by factors like the available code library footprint on a device.

JSON and XML (yes, still) are ubiquitous on the server and many clients and
enjoy very broad library or platform-inherent support, but cause very
significant wire footprint.

CSV is simple, interoperable, and compact (for
text), but it’s structurally constrained to rows of simple value columns –
which is very often enough for time-series data.

BSON, CBOR, and MessagePack are efficient binary encodings that lean on the JSON model
and have great encoding size advantages, but require their own libraries and
bring about some idiosyncratic choices like no first-class array support in the
case of BSON.

Protobuf and Apache Thrift yield very small encoding sizes, but require
distribution of an external schema (or even code) to all potential consumers,
which is a prohibitive requirement in systems of nontrivial composition
complexity.

Apache Avro is generally as or more efficient as these prior options and also
natively supports layered-on compression and can carry the required schema as a
preamble – whereby using the preamble puts Avro at a disadvantage compared to
MessagePack, CBOR, or BSON for small or highly structured payloads with minimal
structural repetition.

This list is not complete, but reflects the most popular options from what I
can see.

Just as important as the encoding is the data layout, which can also have major
impact on encoding size. A naïve JSON encoding approach where telemetry data is
sent in the form of an array of objects where each object carries explicit
properties for all values has enormously greater metadata overhead than a data
layout mimicking CSV with a shared list of headers followed by an array of array
carrying the row data.

Data Structure Considerations

The most common data encodings cover three great groups in terms of the approach
to data structuring:

Comma-Separated-Values (CSV, including all other kinds of separators), and
practically all other tabular data representations lay out data records as
rows with the record data being split into columns. The column definitions
uniformly apply to all rows, but rows may be “sparse”, meaning that a
particular row may only have carry values for a subset of the columns. The
rows/columns structure is very suitable for time-series information.

XML and HTML use a structural model based on a nominally unbounded tree
structure made up of nodes, whereby “elements” can be annotated with
qualifying attributes and may contain other nodes, including plaintext
content. This structural model is most suitable for carrying and describing
distinct sets of complex content that is to be processed by generic
infrastructure components, such as a browser rendering a web page.

JSON and many other encodings, of which several will be explicitly discussed
further on, use a structural model that is very closely aligned with the
structural models used in the most popular programming languages and
frameworks. Values are either held in (one-dimensional) arrays or in maps,
whereby maps are dictionaries with uniquely keyed entries holding values.
Values are of primitive types or are arrays or maps. Multidimensional arrays
are modeled as arrays of arrays. The map/array/value structural model is
the most universal, and I generally recommended it since the rows/columns and
elements/attributes models can be expressed on top of it, while the reverse
is not true.

For most application-to-application scenarios, the map/array/value structural
model is generally preferable, and that ultimately also explains the success
JSON had against XML, in spite of XML initially being the staunchly defended
darling of the standards establishment.

Structural Metadata

There are several models for how data is described with metadata, in terms of
data types and item identifiers, providing the system information about the
particular layout and allowing data items to be identified and appropriately
encoded and decoded.

The most common models are:

External schema – With external schema models, the description of data types
and structure is shared or distributed separately from the communication
path over which the data is exchanged or from where the data is stored. Data
encodings like Apache Thrift or Google Protocol Buffers (“protobuf”) use
this approach with the goal of reducing the data volume transferred over the
communication path or stored on media.

Schema preamble – Schema preamble models separate the schema information
from the data, but carry the schema with the data at all times. CSV commonly
uses a descriptive schema preamble in form of header line that provides
identifiers for the columns, and the column data types can be commonly be
inferred from the data itself. Apache Avro has a formal schema language,
allowing for description of complex structures, and a copy of the schema is,
as a matter of principle, always carried as a preamble with any Avro data
container.

Tagged data – In JSON and many other encodings, data is tagged, with each
data item individually carrying the identifier (where needed) and the data
type.

While the encodings using external schema, like Protobuf and Thrift, do achieve
the goal of reduced footprint, the imposed cost on a complex system is enormous,
as the external schema must be distributed and synchronized throughout all
system components that need to process the information. An approach to this is
to hold the schema information in shared registries.

Information that is durably stored and somehow gets separated from the external
metadata is effectively rendered unusable through that separation. It’ll be a
moment of intense grief when you’re in a highly regulated engineering field
like in automotive or aerospace, open a raw certification telemetry data
archive from cold storage in 10 years for an accident investigation, and
someone forgot to keep the associated schema service going.

I therefore strongly discourage using any data encoding requiring external
schema for durable storage. Furthermore I do not recommended to use any
data encoding requiring external schema for any scenario where the two
communicating endpoints are not under common control or where it is not
practical to near simultaneously upgrade/change the schema at both ends of the
communication path, even if the data encoding supports additive changes.

The marginal efficiency advantages Thrift or Protobuf may have over encodings
like MessagePack or Avro in certain scenarios will always be severely
overshadowed by the burden of external schema management, which becomes
excessively more complex as systems evolve.

Encoding Formats

In the following, I discuss several data encodings with usage scenarios,
whereby all either use the schema preamble or tagging models. That means I am
excluding Protobuf and Thrift apriori because of the external schema concerns
laid out above.

The descriptions are brief and it is encouraged to study the linked
specifications or overviews.

JSON – JavaScript Object Notation

JSON (IETF
RFC8259) is a lightweight, text-based
data interchange format providing map/array/value structural model that is
derived from a subset of JavaScript (ECMAScript). JSON is quite easy to parse
and trivial to generate and therefore ubiquitously available or easily
implementable anywhere.

JSON is a good default choice for all structured data, at rest and in motion,
as it is the most interoperable option with the broadest reach. Practically
speaking, a solution might never use JSON considering the format options listed
below, but JSON ought always be a supported option on all processing and
communication paths.

As JSON is text-based, it has efficiency disadvantages compared to binary
formats, and those will often be preferable in scenarios where storage or
communication path footprint or encoding effort are of concern. JSON is,
however, always the most interoperable option. Because it is text, it’s also a
more robust long-term archival choice. You’ll surely be able to read JSON in 30
years; that’s somewhat less assured with binary formats that are largely
defined by a particular implementation.

Unless otherwise specified in the communication transport frame, all JSON text
is assumed to use the UTF-8 (IETF RFC3629)
text encoding.

CSV – Comma Separated Values

CSV (IETF
RFC4180) is a very broadly used and very
simple convention for encoding tabular data made up of rows and columns. RFC4180
is an attempt at standardizing the convention, but “comma-separated” data
factually occurs in a broad variety of forms with semicolons, the vertical bar
(pipe) symbol, tabs, and other characters used as separators and with data
occurring quoted or unquoted.

The advantage of CSV is that it allows for a quite compact text encoding of
tabular data with a schema preamble in the header followed by data with little
overhead except for separators, and thus a much more compact rendering that a
(naïve) JSON encoding that uses an array or records with repetitive metadata per
record. The downside of CSV is, as mentioned, the lack of a reliable standard or
convention and thus the absence of a type model. In lieu of that, I am suggesting
the following constraints:

In extension of RFC4180, UTF8 character data is used for text encoding (see
here)

As a constraint to RFC4180, all CSV files and streams MUST have a header
line with the column names. Column names may occur quoted or unquoted. The
column name must comply with the JSON rule for constructing strings
(RFC7159, Section 7).

As an extension to RFC4180, JSON type inference is used for column data
during decoding.

All data occurring in surrounding and single quotes (‘) or double quotes
(“) is treated as string data whereby the quotes are removed and do not
count towards the string data.

All column data that is a valid numeric JSON expression (RFC7159,
Section 6) is
treated as a number of the furthermore inferred subtype.

All column data that is a valid JSON null, true, or false value
(RFC7159, Section 3)
is interpreted either as Boolean value or Null as applicable.

An empty column (no data or only unquoted whitespace data) is Null.

All other column data is treated as string.

With the above rules applied, CSV is preferred over JSON for encoding of tabular
data where all columns carry data of primitive types, when a minimum of two rows
is commonly expected.

Apache Avro

Apache Avro
(Specification) is a data
serialization system developed in the Apache Foundation, which features a very
compact (and fairly straightforward) binary data encoding format, a formal
schema language, and implementations across a number of languages and platforms,
including Java, C#, C/C++, and Node.js, which are most relevant in server-side
processing.

While Avro requires a schema for the encoder and decoder logic to function, it
defines a container model where the JSON-encoded formal schema can be carried
as a preamble for the encoded data. A suitable schema for encoding into Avro can
be dynamically inferred from a given object graph, which means that a schema is
always available for decoding and a schema can always be synthesized for
encoding from any given concrete graph. That being so, Avro is very a suitable
alternative to JSON.

Avro yields an extremely compact data encoding that can be further improved by
data compression, which is also directly supported by the specification and the
library implementations.

Because of the schema preamble being carried as plaintext JSON, the Avro
encoding can only start playing out its strength once the data encoding savings
eclipse the size of the schema preamble when compared to a JSON encoding or an
encoding in one of the alternate binary formats explained below.

I recommend Apache Avro as the preferred binary encoding for transferred
time-series data and all other structured data with significant structural
repetition, obviously only if an Avro implementation is available for the
devices in question. I also recommend Apache Avro as the preferred service-side
media storage format for structured data due to its compactness and native
support for data compression.

Apache Avro use should, however, be carefully considered in all cases where data
must be preserved in archives and outside the system context for extended
periods of time. Plain text formats take up more space, but the lack of
dependency on a particular binary format specification and implementation of
such a specification reduces the risk of the data not being decipherable decades
into the future.

AMQP Encoding

The AMQP 1.0 Protocol (ISO/IEC 19464:2014,
OASIS)
includes a compact binary type
encoding
providing a map/array/value structural model. The AMQP encoding is a tagged
format that is significantly more efficient than JSON, and has a higher fidelity
type system for numeric types and date-time expressions.

The advantage of AMQP 1.0 encoding is that the encoder is readily available as
part of any AMQP 1.0 client stack and therefore doesn’t require adding another
library to the overall client library footprint, which is often a concern in
embedded systems scenarios.

Generally, when AMQP 1.0 is used as a transport, AMQP type encoding is
technically superior to JSON. The downside of choosing AMQP encoding is that
the encoder/decoder is typically tied to the transport stack, meaning that it’s
a problematic choice when the encoding/decoding from/into object graphs doesn’t
happen right at the messaging API boundary or when the messaging API is
abstracted (like in JMS).

AMQP is a compact choice for single records and highly structured information
with minimal structural repetition, but is less efficient than Apache Avro for
data with highly repetitive structural elements like time-series data.

Except in pure AMQP scenarios that aim for maximum efficiency while using a
single encoding stack, it’s not preferable choice for payload encoding due to
standalone AMQP encoders not being in widespread use.

MessagePack Encoding

MessagePack
(Specification) is a
very compact binary encoding providing a map/array/value structural model.
MessagePack is a schemaless, tagged format. It is more efficient than JSON and
the AMQP encoding.

Apache Avro’s schema-preamble strategy and native compression support still
yields significant advantages over MessagePack with highly repetitive
structural elements, but MessagePack is a great encoding choice for single
records and highly structured information with minimal structural repetition.

When Avro is not an option for structural reasons, whether a solution opts for
AMQP or MessagePack encoding depends on protocol use, library availability, and
library footprint considerations. AMQP encoding is essentially a very reasonable
and only slightly less efficient fallback option whenever MessagePack can’t be
used, or when AMQP’s ISO/IEC standardization matters for policy reasons.

Like AMQP, MessagePack encoding is not recommended for bulk data storage.

CBOR Encoding

The Concise Binary Object Representation (CBOR)
(Specification) is another compact
binary encoding providing a map/array/value structural model, blessed by IETF
in RFC7049. Like MessagePack, it’s a schemaless, tagged format and also has
quite a few implementations, even though not quite as many as MessagePack.

If MessagePack is a good choice for a scenario, CBOR will likely be a similarly
good choice and my guidance is equivalent.

Data Layout Convention for Map/Array/Value Encodings

While data encoding refers to how data is transformed to and from bits for
transfer and storage, a data layout convention describes how structure of the
data is constrained so that data can be universally handled across a system.

The data layout convention presented here should be understood as a guidance
model informing schema definitions at the solution layer, with message and data
schemas following the layout conventions below, and adding concrete
semantics and data type choices for particular data items on top of the generic
foundation.

The most important principle around data layout is that the data unit handled
and processed in the context of model is a record, and not a message or a
storage block, or document. Each of these storage or messaging transfer units
may contain one or multiple data records (or events). A sequence of records
may span multiple messages or storage units. In the following, “message” will
refer both to messages and storage blocks or documents.

The row/column model of CSV provides a natural set of constraints for the layout
and only allows for a not explicitly bounded list of rows, each equating to a
record, with a not explicitly bounded set of columns, whereby each column value
is of primitive type.

For the map/array/value structural model supported by the JSON, Avro, AMQP, and
MessagePack data encodings, the following layout models are proposed:

Singular Record

A singular record occurs as a distinct object inside of a message. It may reside
at the root of the message, or it may be uniquely and unambiguously identifiable
through a reference expression. A single record is laid out as a map
(dictionary). The values of a single record may be of primitive types, arrays,
or objects.

Record Sequence

A record sequence occurs an array object inside of a message. It may reside at
the root of the message, or it may be uniquely and unambiguously identifiable
through a reference expression. The records inside the array follow the above
rules for singular records. Records inside the array don’t need to be entirely
homogeneous, meaning they don’t need to have identical sets of properties, but
all properties that do overlap by identifier must be of the same type.

The columns may diverge in type, as permitted by JSON and MessagePack. The data
is modeled as a list in AMQP encoding and as an array of union types in Avro.

Encoding Decision Matrix

For deciding which of the presented encodings to choose, this matrix may help.
In the use-case column, “flat” refers to records that solely consist of
primitive data types. ‘Complex” refers to data where records contain nested
object structures. The Avro column assumes use of containers with inline schema.

Summary

As a summary, I’d like to suggest to embrace the “Content-Type” declaration
available in protocols like HTTP, MQTT (5.0+), and AMQP, and choose the
appropriate data layout and encoding per use-case. Make data encoding and
layout choices an explicit engineering decision, and don’t blindly pick a
format to rule them all. Also be deliberate about long term storage choices and
really think hard about taking dependencies on formats requiring external
schema references outside of RPC scenarios.

PS: There has been some feedback that I should have included one or more of the
ASN.1 encodings,
in particular DER. Since
ASN.1 is a schema format, my concerns are substantially the same as for
Protobuf and Thrift: Use external schema with caution.