Data Eng Weekly

Hadoop Weekly Issue #167

25 April 2016

Welcome to a special Monday edition of Hadoop Weekly. There's lots of great technical content this week from Spark to Kafka to Beam to Kudu. If you're looking for something even more bleeding edge than some of those technologies, Apache Metron (incubating) had its first release. Metron, which is a general-purpose security system built on Hadoop, is a project to keep an eye on going forward.

Technical

This presentation serves as a guide to building a stream processing system in AWS. It describes relatively simple solutions such as Amazon Kinesis with AWS Lambda and the Kineses S3 connector as well as more complex solutions for real-time analytics that make use of many AWS solutions.

This post describes how to use Spark Testing Base, which is a testing framework for Spark written in Scala, from Java. The example code shows how to refactor Spark code to isolate the logic to test as well as how to deal with some of the gnarly Scala APIs from Java.

LinkedIn has posted about their Kafka ecosystem, which includes a special Kafka producer, a REST API for non-java clients, monitoring, an avro schema registry, Gobblin (a tool for loading data to Hadoop), and more.

Apache Kudu (incubating) is an exciting companion to Apache Impala (incubating) because it can efficiently answer both broad analytics and very targeted queries. This post describes the technical details of the integration, how Kudu's design provides efficient querying capabilities, how to perform write/update/delete operations with Impala and Kudu, and more.

MapR has a post about using spark-sklearn to scale out an existing scikit-learn model. It walks through building a model from the Inside Airbnb dataset and describes how to plug in spark-sklearn for cross validation.

The AWS big data blog has a tutorial describing how to use HBase and Hive with Amazon EMR. The post includes an introduction to HBase, describes how to restore a HBase table from S3, demonstrates Hive and HBase integration, and more.

This post describes some of the challenges in providing real-world experience to students taking a big data course. The author has gone through several iterations and options and seems to have finally landed on a good solution—Altiscale's Hadoop-as-a-Service.

The Cloudera blog has a guest post in which the author compares Parquet and Avro across two data sets—one that's narrow (3 column) and one that's wide (103 column). Using test query/operations in Spark and Spark SQL, the author finds that queries against Parquet and Avro serialized data sometimes perform similarly, although queries against Parquet data are much faster (and serialize data much smaller) in many cases.

This article describes how to use SparkR with a distribution, like CDH, that doesn't officially support it. By leveraging YARN and locally installed R packages on the workers, jobs can be executed with little additional work.

There have been a number of open-source frameworks to execute MapReduce and similar jobs with a higher-level programming model. Historically, these have been tied to individual execution frameworks (e.g. MapReduce, Storm), but there's recently been work to make them agnostic. Apache Beam (incubating) aims to take that even further, generalizing across execution models for both batch and streaming and offering built-in support for complex compute models.

The Apache blog has a 7-part series presenting experimental results for HBase write throughput across HDD, SSD, and RAMDISK. In performing the analysis, the authors found and proposed fixes to a few uncovered issues in HBase and HDFS.

News

Tom White, the author of "Hadoop: The Definitive Guide," has written about how he became involved in Apache Hadoop. His early contributions were around integration Hadoop with Amazon Web Services, which has been an important part of the project's success.

Releases

Apache Metron, a security framework built on Hadoop, has released version 0.1. Hortonworks is supporting it as a tech preview, and has written about the features, how to get started, how to contribute, how to use the Metron UI, and more.