Data Eng Weekly

Hadoop Weekly Issue #221

25 June 2017

Several more articles on proprietary tools than I usually cover, but there are really interesting things to note from Google, Qubole, and Amazon Web Services. The new Hortonworks Streaming Analytics Manager looks neat, and there are a couple of articles on Kafka. Finally, the slides and videos from two recent conferences—DataWorks and Berlin Buzzwords—have been posted.

Technical

The Google Cloud Big Data blog has a post on common use cases for Cloud Dataflow. While some of the content is Google Cloud-specific, the patterns and the psuedo-code presented is largely general purpose—and it's interesting to see how Cloud Dataflow solves various problems.

The Qubole blog describes how they keep the Hive Metastore's statistics up to date. They've implemented a custom MetastoreEventListener to detect new tables, partitions, etc and kick of Hive ANALYZE commands to compute statistics. There are a few improvements over the naive solution—particularly throttling and batching of commands that involve multiple partitions.

Hortonworks has an in-depth look at the new Streaming Analytics Manager. The post describes the main components—service pools and environments—and describes how to build an application using the String Builder canvas. There's integration with the Hortonworks Schema Registry to automatically detect the schema from a Kafka topic and builtin support for common streaming processors like joins, projections, and aggregations.

Qubole has a jump start on a lot of other vendors when it comes to big data as a service. This post writes about one of their cloud-specific differentiators—Container Packing. Based on the YARN fair scheduler, container packing is a mechanism to improve the scale-down capabilities of a auto-scaling cluster to ultimately keep costs down. The post describes the high-level algorithm for container packing, and how to enable it in Qubole.

The Cloudera blog has an overview of various strategies for managing offsets when running Apache Spark Streaming jobs based on data in Apache Kafka. The post includes code for saving and loading offsets to Apache HBase, Apache ZooKeeper, and Kafka.

This post provides an overview of how to use Kafka for streaming ETL. The tutorial uses Kafka Connect for extracting data from a relational database (including a simple transformation), running a Kafka Streams application, and then loading database to another database (once again) using Kafka Connect. The post has lots of code (which tends to be mostly configuration) and an overview of what each of these pieces is doing.

In this post, SparkR is used to parallelize a Markov Chain Monte Carlo calculation to improve runtime from 48 hours on a single machine to 45 minutes on a 50 node Spark cluster. The post has some a few good tips and highlights some gotchas related to SparkR.

The AWS Big Data blog has a post on best practices for Amazon Redshift Spectrum ( using Redshift to query data in S3). Among the recommendations, are suggestions for when to use Spectrum vs. Athena and how to allocate data between Redshift local storage and S3.

The Confluent's Log Compaction has coverage of the upcoming Kafka 0.11.0 release, which will have exactly one semantics via an idempotent producer (among other things). The post also has links to a number of great Kafka-related blogs and presentations.