Data Eng Weekly

Data Eng Weekly Issue #275

29 July 2018

Lots of great content this week on topics from high-level architectural patterns for microservices, data platforms, and distributed systems to looks at systems internals (e.g. Kafka authorization and Qubole's spot instance framework) to sharing lessons from building out data platforms in adtech and for a museum. There's also news and announcements about several conferences, releases, and new projects.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7 or visit http://dremio.com to learn more.

Technical

This post goes through examples of each of the eight fallacies of distributed systems and provides some common architectural patterns for each.

Debezium, the change data capture system, uses the database's own log files (e.g. the transaction or bin log). This has a number of advantages over polling the database for changes—five of those advantages are highlighted in this post.

Here's a description of the bug in Kafka’s authorization layer underlying a recently disclosed vulnerability. THere's also a look at the patching/disclosure process and what other work has been done to verify the authorization implementations in the project.

As more and more systems support common formats and data stored in cloud object stores, it opens up opportunities to simplify the number of copies of data. This post provides some useful terminology to talk about the flavors of data architecture and their tradeoffs.

A look at the data architecture and data engineering at Unruly, which is a company in the adtech space. Unruly is using both AWS and Google BigQuery with Apache Airflow to drive workflows. The post talks about the evolution of their platform, such as their move from bash to python and their future option of replacing BigQuery with Athena to simplify the architecture.

Qubole writes about the components they've implemented to optimize performance in the face of spot instance termination in AWS. They've added functionality to respond to spot termination events to decommission the instance housing HDFS DataNodes (to ensure no new blocks written) and proactively cancel Spark executors rather than waiting for them to timeout. These features, coupled with their other modifications to work with spot instances, improve performance and reliability while keeping costs low.

This article is about the data platform of a museum, which collects a number of data points through an interactive pen that visitors use. The data is processed using Logstack and ElasticSearch, and they've built a data warehouse using Amazon Redshift. It's an interesting look at a really different application than your normal internet company data platform.

Cloudera has the second part in their series on implementing Avro data serialization atop of Apache Kafka. This post looks at building a simple in-memory store as well as builds the first parts of a schema registry that stores its schemas in a Kafka topic.

A great overview of microservices communication patterns and their trade-offs. These include: service-service communication, API Gateway, client-side aggregation, and event bus. There's also some advice for how to get off the ground with a new project.

Releases

Astronomer has released version 0.3.0 of their managed Apache Airflow product. The release runs on Kubernetes to support major clouds and on-prem deployments, improves high availability, and has improvements to the command line utility.

Sponsors

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7 or visit https://dremio.com to learn more.

Unravel released a step-by-step video http://bit.ly/unravel-video on how to make Kafka and Spark streaming applications fast and reliable. Watch it now.