Data Eng Weekly

Data Eng Weekly Issue #290

18 November 2018

Several different technologies covered this week—Kafka, Pulsar, Spark, Druid, Airflow, and HDFS. In open-source news, Etsy announced an Airflow companion tool, Edmunds announced two tools for working with Databricks deployments, and Pravega announced a ZooKeeper Operator for Kubernetes. There's also an interesting look at how WeChat operates at massive scale. On a scheduling note, I'll be skipping next week's issue—so look for issue #291 on December 2nd. Please send articles you find my way in the meantime!

Sponsor

Interested in extracting value from large quantities of data? That's what we do at Criteo and it's actually pretty much all we do. We have R&D job openings in data engineering, machine learning, and related areas. We use a large number of Big Data technologies, mainly open source, and like to contribute to the open-source community as much as possible. We are looking for people who are interested in tackling the challenges that come with ingesting and analyzing hundreds of terabytes of data per day. Never heard of Criteo? We are an international ad-tech company with offices all over the world. We have a pretty cool corporate culture: http://bit.ly/criteo-culture. We take Big Data seriously enough that we have our own conference, NABD http://bit.ly/NABD-Conf, which is guaranteed to be free of marketing talks.

Technical

Edmunds writes about how they've scaled their use of Databricks notebooks. They've implemented, and open sourced, several tools to automate checks (e.g. to validate job names), establish common software patterns, and implement a deployment pipeline. The open-source tools are a Java REST client for Databricks and a Maven plugin.

Streamlio has examples of implementing several approximation algorithms via Apache Pulsar functions. Their post covers Bloom Filters, Count-Min Sketch, HyperLogLog, and more. There are good visuals accompanying each of the algorithms.

The slides and accompanying source code from the "Build Event-Driven Microservices with Apache Kafka," which was a 3-hour workshop at QCon SF, have been posted. There are three labs that cover building a web service for an online orders application.

The Confluent blog has a post explaining Kafka Connect's Converters, which are used to serialize and deserialize data. It describes common errors and problems related to Apache Avro and JSON (two common formats), how to troubleshoot problems by looking at logs and the inspecting data directly from Kafka, and working with other data formats.

Airbnb writes about their experiences with Druid for analytics. They describe how Druid complements their other big data systems, how they ingest data with Spark Streaming, integration with Presto, monitoring, and challenges/future improvements.

Etsy has open sourced their tool, called boundary-layer, for defining Apache Airflow workflows using YAML. One of the reasons for the popularity of Python for workflow engines (e.g. Luigi, Airflow) is that there is inevitably scripting involved in running and deploying a workflow. But not everyone knows Python, and there are some other reasons why a declarative format might make sense (a compelling one here is that boundary-layer can convert Oozie Workflows to its YAML format). This post has more details on their reasoning and how the tool is used at Etsy.

Criteo (disclosure: they are sponsoring this week's issue!), has written about challenges in scaling their Hadoop cluster. There are a number of interesting anecdotes from debugging NameNode performance issues and outages. The Criteo HDFS cluster adds 40PB of data each year, and the NameNode is storing 100s of millions of blocks.

WeChat recently published a paper about how they scale their microservices architecture (which is made up of over 3000 services). The morning paper has a great summary of the paper and the overload control mechanism, which is essential for handling large scale spikes in traffic.

Amazon has announced that they're maintaining a distribution of OpenJDK, called Amazon Corretto. The OpenJDK 8-based version is available in preview now (with GA targeted for Q1 2019) and an Open JDK 11-based version planned for April of next year.

Sponsor

Interested in extracting value from large quantities of data? That's what we do at Criteo and it's actually pretty much all we do. We have R&D job openings in data engineering, machine learning, and related areas. We use a large number of Big Data technologies, mainly open source, and like to contribute to the open-source community as much as possible. We are looking for people who are interested in tackling the challenges that come with ingesting and analyzing hundreds of terabytes of data per day. Never heard of Criteo? We are an international ad-tech company with offices all over the world. We have a pretty cool corporate culture: http://bit.ly/criteo-culture. We take Big Data seriously enough that we have our own conference, NABD http://bit.ly/NABD-Conf, which is guaranteed to be free of marketing talks.