Data Eng Weekly

Data Eng Weekly Issue #277

12 August 2018

Quite a bit of variety in this week's issue, including Kafka on Kubernetes, Docker on YARN, speeding up data parsing by filtering raw data, Hadoop at Microsoft, and the NSA's LemonGraph open source project. Also, a couple of new books to check out and releases of Flink, Landoop Lenses, and a KSQL plugin for VSCode.

Sponsor

From the creators of Apache Arrow, Dremio is an open source Data-as-a-Service platform. Accelerate your queries (up to 1,000x!) and make data truly self-service for your BI and data science users.

Technical

Hortonworks published a four part series this week on Docker on YARN. It describes a financial derivatives application, which is built on a C++ library whose dependency is easily fulfilled via a docker image. The series walks through several details about the architecture, deployment, and more.

Cloudera's solution for Kafka schema management doesn't require a separate service, which makes its usage and configuration a bit different (and somewhat simpler) than other solutions. This post walks through how to setup a consumer.

The Hadoop YARN capacity scheduler now features application lifetime SLAs. This post shows how to configure the settings, both as an admin and at job submission time, if you want an application or queue to have a maximum lifetime.

BigQuery has a new cluster feature that is similar to partitioning in many data systems. When enabling clustering, queries that include the clustered columns in a filter/where clause can see significant speedups by scanning much less data. This post shows several examples of the speedups and costs savings from enabling clustering.

Confluent has published Helm Charts and a white paper for running the Confluent Platform on Kubernetes. The deployment makes use of StatefulSets and Persistent Volumes for Kafka and ZooKeeper, while using deployments for stateless services. The Helm Charts on are on Github, and the white paper is behind an email wall.

Sparser is a new research project to speed up queries over large data sets by filtering raw bytes before parsing. As described in this post, it uses SIMD-based filters and an optimizer that arranges/selects candidate filters to build an efficient pre-parsing filter. It works on both JSON and binary formats (like Parquet) to speed up end-to-end processing time in Spark by up to 9x. The code is on github, and hopefully we'll see these techniques integrated into some common projects soon.

This interview with Honeycomb co-founder and CEO Charity Majors covers a lot of ground in the observability (which is often considered the successor to monitoring and alerting) space. She talks about Facebook's internal debugging tool "Scuba," the systems that they've built at Honeycomb, and she provides a bunch of great advice about the state of the art and best practices in debuggability and observability.

If you have a Spark cluster publicly accessible on the internet, there's a Remote Code Execution exploit that you should be aware of. It makes use of the Spark REST API to download and execute a rogue program. Alibaba has detected this in the wild.

Microsoft writes about their 50,000+ node Apache YARN cluster that is used by over 15,000 developers performing applied research and science tasks. It's an exabyte scale cluster and is the largest known cluster in the world. The article speaks about several of the performance issues and optimizations that the team has implemented.

Github has open sourced their Github Load Balancer (GLB) Director, which is a key part of the GLB architecture. This post describes the technical details of how GLB operates. While it's not a typical data system, there are a number of interesting distributed system challenges and solutions discussed (e.g. consistent hashing, routing, testing, failover, and more).

LemonGraph is a recently open sourced graph (nodes/edges) database project from the National Security Agency. This blog series tours the codebase, which is python (wrapping a bunch of native C code). It looks at the storage component, query parsing, execution engine, and more.

"Getting Started with Kudu" is a new book that covers the architecture, use cases, and design patterns for Apache Kudu. One of the co-authors provides a bit more detail about the book and has some pointers for getting in touch with the authors via the Kudu Slack channel.

With new elastic data warehouse technologies like Redshift and BigQuery, it's become pretty common to load raw data before doing any transformations. This post describes several reasons why this technique—ELT—has been replacing ETL.

"Kafka Streams in Action" is an upcoming book on building applications with the Kafka Streams API. The Confluent blog has the foreword, which is written by Confluent co-founder and CTO Neha Narkhede. The first chapter of the book is also available for free on the Manning website.

Releases

Apache Flink 1.6.0 was released. It has a number of features that improve stateful stream processing, such as support for state TTL, job submission over HTTP/REST, Apache Avro support for streaming SQL, UDF and batch query support for the SQL client, and a Kafka table sink. Flink also now has Jepsen testing for fault tolerance.

Landoop has released Lenses 2.1, which includes a number of updates to the Lenses SQL streaming engine. The new version adds support for custom serialization formats and bundled support for XML and protobuf. It also supports arrays and several UI features to visualize streaming programs.