Data Eng Weekly

Hadoop Weekly Issue #219

04 June 2017

Lots of great technical posts this week, including several on using Amazon S3 cloud storage and building data systems with Apache Kafka. There's also a post on the Luigi workflow engine, writing Cucumber tests for Spark, and several news/release posts.

Technical

The Pivotal blog has a tutorial (with sample code) for building a machine learning pipeline using the Luigi workflow engine. The main functionality in the example is Apache MADlib (incubating), which is executed via PL/pgSQL from the Luigi tasks.

This post describes how to use Apache Apex to write data from Apache Kafka to Apache Kudu. In addition to the basics, the post covers how to implement exactly once semantics, how to handle partial (single column) updates, and some of the operational metrics that are captured as part of the process.

The Databricks blog has two posts on cloud storage. In the first, they describe a number of advantages of using Amazon S3 over HDFS when in AWS—including 5x cost savings and higher availability and durability. The second post is on transactional writes to Amazon S3, the absence of which has often been a drawback. While other cloud services got this functionality previously (e.g. Amazon EMR introduced EMR-FS in 2014), Databricks is getting a new feature to improve on the support available in vanilla Hadoop's S3 implementation.

The MapR blog has an in-depth overview of performance tuning on a real-life application that involves Apache Kafka, Spark Streaming, and Apache Ignite (for caching of RDDs). Improvements include increasing the number of Kafka partitions, fixing an RPC timeout setting, tuning memory of both Spark and Ignite, and modifying the batch interval.

This post makes the case that "Apache Kafka is more disruptive than merely being faster ETL." It highlights several advantages that Kafka brings, including integration between streaming/applications/databases, distributing ETL (rather than a centralized monolith), and scale & reliability.

Datanami has a post describing Pandora's Kafka deployment, which uses Kafka Connect to write Apache Parquet files to a Hadoop cluster. The are making use of the Kafka Schema Registry, and they've written a custom Gradle plugin for migrations. As the post highlights, they've had a positive overall experience despite some issues (e.g. when HDFS is unavailable, the HDFS Sink Connector can corrupt its WAL).

As a nice follow up to the Databricks posts on S3 vs. HDFS, this post describes some of the main features and options of S3DistCP for copying data from HDFS to S3. Based on Hadoop's DistCP, S3DistCP is optimized for S3 and offers features like changing file compression.

News

The Confluent Log Compaction post includes a preview of what to expect in Kafka 0.11.0.0. New features include exactly-once semantics, a new admin API, and improvements to Kafka Connect and operations.

Apache SystemML, which is a machine learning library that's built for scaling out on Apache Spark and Apache Hadoop, has become a top-level project. The press release includes an overview and quotes from some companies that are using it.

Apache Hadoop CVE-2017-7669 is a privilege escalation in the docker feature that was added to Apache Hadoop 2.8.0 (and other alpha releases). There isn't yet a fix for the 2.8.x line, so mitigation is to disable Docker support.

Apache Flink 1.3.0 was released with several major areas of improvement. Specifically, there are improvements to recovery (and state handling), the DataStream API, the Table API (SQL), and deployment and tooling (including watermark monitoring the web front-end). More details about the new features can be found at the release announcement.