When entering the world of Apache Kafka, Apache Spark and data streams, sooner or later you will find mentioning of another Apache project; namely Apache AVRO. So ….

What is Apache Avro ?

Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project (source: wikipedia). It is much the same as Apache Thrift and Google Protocol Buffers. Probably the main reason for Avro to gain in popularity is due to the fact that Hadoop-based big data platforms natively support serialization and deserialization of data in Avro format. Avro is based upon JSON based schemas and messages can be sent in both JSON and binary format. If binary is used then the schema is sent together with the actual data,

Playing with Apache Avro from the command line

So first let’s create a Avro schema “location.avsc” for the data records

Generate data record from JSON to Avro

The result is an output “location.avro” with the Avro binary. The interesting about Avro is that is encapsulates both the schema and the content in it’s binary message.

Objavro.schema#####ype":"record","name":"Location","namespace":"nl.rubix.avro","fields":[{"name":"vehicle_id","type":"string","doc":"id of the vehicle"},{"name":"timestamp","type":"long","doc":"time in seconds"},{"name":"latitude","type":"double"},{"name":"longtitude","type":"double"}],"doc:":"A schema for vehicle movement events"}avro.codenull~##5############.1м##