Data Eng Weekly

Data Eng Weekly Issue #269

17 June 2018

Companies have shared lots of great posts this week—Pandora's web UI for Kafka, metadata management at Netflix, GraphQL at AirBnB, robust data pipelines at DataXu, and fronting Kafka at GO-JEK. There's also coverage of the new YARN long running application scheduler, a high performance single server stream processing engine, and a recap of the recent Spark + AI summit.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.

Technical

AirBnB has written about their experiences implementing GraphQL as an API gateway atop of Apache Thrift services. The post has a good mix of technical (their architecture including Thrift/GraphQL translators) and non-technical (about how to frame the conversation and seek compromise) topics.

Originally in Chinese, this post analyzes a recent exploit of unsecured Apache Hadoop YARN clusters that was used for cryptocurrency mining. It also outlines how to secure a cluster with publicly accessible endpoints.

Amazon DynamoDB has change data capture feature called DynamoDB streams. It easily integrates with AWS Lambda for real-time processing. This article explains how to use these features to compute real-time aggregates. There's a good discussion of how to tune the system for correctness, for error handling, and to increase throughput.

It can be a challenge to share large research and government data sets (think atmospheric or satellite data). To make this type of data accessible, this post proposes that organization "Place your Big Data in cloud object storage in a self-describing, cloud-optimized format." It goes into some more details about the challenges (and some solutions) that are unique to these types of data in adopting that practice.

Dataxu shares their solution to data synchronization—handing off data from one step in the pipeline to the next. Rather than relying on file system paths, they have a centralized "file feed" protocol that provides a number of benefits.

This post compares SABER, a single-server stream processing engine, to Apache Flink and Apache Spark. With modest hardware (20 cores, 32GB RAM), SABER outperforms a 5-node cluster of each. In some ways, this post is reminiscent of the "CLI tools are 235x faster than Hadoop" thread from a few years back.

Qubole has a post about their new query optimizer feature that estimates the total amount of memory needed for a Presto query. There are details on the design and correctness results from the TPC-DS benchmark.

Many organizations design microservices so that they each use their own data store to avoid the drawbacks of a multitenant database system. This post describes how Kafka as an event store is an interesting alternative architecture.

The Morning Paper has coverage of the Medea scheduler, which implements scheduling for long-running applications atop of Apache Hadoop YARN. Medea offers constraints like anti-affinity (to keep HBase region servers on separate nodes), global optimizations, and more. The authors compare it to other schedulers like Hadoop YARN's previous scheduler and a Java version of the Kubernetes scheduling algorithm. Medea is in use at Microsoft and is part of the Apache Hadoop 3.1.0 release (YARN-6592).

The GO-JEK team uses a fronting REST service for ingesting data into Kafka. That service in turn writes data to a fronting Kafka cluster, or it fails over to Redis if Kafka is down. This post explains more about the motivation and architecture.

The Apache Hadoop YARN Service Framework makes it quite easy to deploy a long-lived application to Hadoop via a single Yarnfile definition. The Hortonworks blog has a brief overview of what it takes to migrate Apache Hive LLAP from Apache Slider to use the YARN Service Framework.

This post introduces Metacat, Netflix's tool for data discovery, programatic dataset metadata access, and more. It is a proxy to other backends (such as the Hive metastore), and it provides advanced features via an elasticsearch index. Metacat is open sourced on github.

This list of distributed systems papers has been updated with some new content from the past 4 years. If you're interested in learning the fundamentals of distributed system theory, it's a great place to start.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.

Releases

Pandora has open sourced KBrowse, a web ui and search tool for Apache Kafka. This post walks through how they use KBrowse at Pandora to debug issues with new content.