Data Eng Weekly

Hadoop Weekly Issue #159

28 February 2016

The theme of this week's newsletter is stream processing—from streaming SQL to streaming in Spark 2.0 to Spark Streaming with Amazon Kinesis to IBM's Quark framework. Atop of that, there's interesting content covering Airbnb's data infrastructure and Drillix as well as lots of releases (Apache HBase, CaffeOnSpark from Yahoo, and more). And finally, as the month comes to a close, there are two CFPs ending in the next 48 hours.

Technical

The Apache Calcite project has been working to add support for streaming SQL, which could be integrated with systems like Apache Samza, Apache Flink, and Apache Storm. This presentation introduces some of the core concepts for streaming SQL, such as windowing and grouping, and gives several examples of streaming SQL queries (including a few which join against regular tabular datasets).

On the topic of real-time stream processing, this presentation describes the future plans for stream processing in Spark 2.0. The new version will introduce the notion of infinite DataFrames to provide a streaming API using the Spark SQL engine. The slides introduce the model, give examples of ETL and a streaming page view count, describe more about the solution, and give a timeline for Spark 2.0/2.1+ features.

In another post about the upcoming Spark 2.0 release, this article covers the various memory, code generation, and vectorization optimizations planned. The post recaps the motivation and describes in detail the main components of Tungsten, the initiative for implementing these improvements.

In another SQL presentation, this time on non-streaming SQL, Apache Drill and Apache Phoenix are combined into a new system called Drillix. Drillix is powered by Apache Calcite (which is used by both Drill and Phoenix) and the recently announced Apache Arrow. The bulk of the presentation covers Drill—both the user-facing features and the underlying technology components. But the slides also give an introduction to Calcite, Drillix, and Arrow.

The primary goal of Apache Arrow is to make data interoperable across tools and languages. This post talks about how Arrow might help improve performance and interoperability of the Python framework pandas.

Airbnb has written a long post about the current state and evolution of their data infrastructure. It mentions the tools they've previously used (such as EMR, Mesos and EBS for HDFS storage), why they moved away from those systems (cost, efficiency, lack of visibility), and the new tools and systems they're using (such as Airpal, Presto, and Kafka). There's also insight into AWS-specific details, such as networking configuration (run in a single availability zone) and AWS instances (r3.8xlarge instances for Spark and d2.8xlarge instances for HDFS).

Cloudera has a post about using Cloudera Director, their framework for scaling Hadoop clusters in the cloud, as part of a data pipeline. Specifically, there are example scripts (using wget and jq) for starting a cluster, running a job, and terminating a cluster.

The AWS blog has a post introducing Spark streaming with Amazon Kinesis. The article covers topics like building a Spark cluster using EMR, dynamic resourcing, failure recovery (checkpointing to DynamoDB), and more. There's lots of sample code included to help get started on a new project.

MapR announced a few changes and additions to its management team this week. In the context of some of the changes (such as a new CMO), the post describes how MapR sees and positions itself in contrast to other vendors like Cloudera and Hortonworks.

This post summarizes what to expect in the forthcoming Spark 2.0. The two main features are (as mentioned in technical posts above) a new stream processing framework and Tungsten for more efficient memory management. Another important part of the release is various housekeeping improvements—for example cleaning up and consolidating APIs.

IBM has submitted Quarks to the Apache incubator. Quarks is a new type of stream processing system built for Internet of Things. In particular, it supports pushing computation to the "edge" components, such as embedded devices.

Releases

Yahoo has open sourced CaffeOnSpark, a framework for deep learning on Spark. CaffeOnSpark's APIs support DataFrames and provide a number of built-in data sources (such as LMDB and images stored in sequence files). Along with demoing the API, the introductory post shows how to run a CaffeOnSpark job using the spark-submit command-line tool.

Last week, I mistakenly stated that Apache Accumulo 1.7.0 was released. The new release was actually 1.6.5, and this week there's also a 1.7.1 release. Both are maintenance releases resolving 55 and 150 issues, respectively. The 1.7.1 release has some important fixes, which are noticed in the release note highlights.

Apache HBase 1.2.0 was released this week. The release adds support for JDK8, support for Hadoop 2.6.1+ and 2.7.1+, dynamic configuration reloads, and much more. In all, there are over 600 issues resolved as part of the release.