Data Eng Weekly

Hadoop Weekly Issue #144

08 November 2015

After skipping last week, this issue has a lot of content. Notably, there have been a bunch of releases over the past two weeks—Hadoop, Tajo, Phoenix, Slider, Apex, and Storm. In news, Hortonworks announced quarterly results, and there's a new free eBook "Hadoop with Python." Technical content includes tutorials (Apex and Kudu+Impala) and internals (Kafka and Phoenix).

Technical

The DataTorrent blog has a tutorial for writing an Apache Apex application in Scala. The tutorial shows how to setup a Maven project, write a LineReader, Parser, and Application, and run the application with dtcli.

The Confluent blog has a post describing how Kafka implements "request purgatory"—tracking requests that haven't yet succeeded or encountered an error. The original implementation uses Java's DelayQueue, which shares performance characteristics with a priority queue. The new design uses Hierarchical Timing Wheels, which offer faster, tunable performance characters. The post describes the implementation in detail and gives an overview of performance benchmarks comparing the old and the new.

Hortonworks has a post describing the components and features of Spark that they've worked on in the past year, and where they're concentrating effort for the future. Past work includes ORC support, an Ambari stack definition for Spark, machine learning library improvements, and documentation updates. Future work includes maturing Apache Zeppelin, an entity disambiguation library, a new Spark + HBase integration, the ability to persist RDDs to HDFS's memory tier, and making Spark streaming more robust.

The recently released Apache Phoenix 4.6 includes support for declaring ROW_TIMESTAMP as part of a table's primary key. BY doing so, the value is stored using HBase's native row timestamp, which provides performance gains. Particularly, when scanning regions with HFiles that haven't been compacted, the ROW_TIMESTAMP information can be used to skip entire files. This is particularly handy when reading recently-written data. The introductory blog post describes the optimization in more details and shows example query response times with this feature enabled and not.

Kudu, the new storage engine from Cloudera, integrates with Impala for SQL access. This post describes how to setup Impala with Kudu (this currently requires a custom build of Impala), how to tell Impala about data stored in Kudu, how to perform various SQL operations (both read and write/update queries), and more.

This post describes the types of RDD persistence available in Spark. The default is memory-only, which is performant but can lead to OutOfMemoryError's. The post has a brief overview of the performance characteristics and trade-offs of several other options.

This tutorial describes how to use Apache Ambari to install and configure the Tachyon FileSystem, which is a memory-centric distributed storage system. The post also has a brief example of using TachyonFS from Spark.

Depending on data sizes and distributions, an inner join in MapReduce can be performed efficiently in a few different ways. This post describes, in a high-level, several of the strategies for implementing an inner-join with MapReduce. For each (e.g. reduce-side, map-side), the post describes some of the relevant Hadoop APIs.

Myriad is a system for running YARN atop of a Mesos cluster. This post looks at how to use Docker's overlay network plugin to isolate YARN clusters (with the ResourceManager and NodeManager running inside of Docker). All clusters share a common distributed file system, which can be accessed via another network bridge. The post has many more details about and code (including Dockerfiles and scripts) for implementing the solution.

News

Hortonworks announced quarterly results this week. They reported a loss of $0.74/share (adjusted) on $33.1 million in revenue, both of which beat the average analyst estimate (of those surveyed by Zacks Investment Research).

Releases

Apache Phoenix 4.6, the SQL framework for HBase 0.98, 1.0, and 1.1, was released. The new release includes support for HBase native timestamps, a correlation variable, an alpha-version of a web-app for viewing trace information, and more.

Apache Tajo, the SQL-on-Hadoop data warehousing system, released version 0.11.0. The new release adds support for nested record types, ORC files, Python UDF/UDFA, tablespaces, and multi-queries. The release also includes improved performance for the JDBC drivers, joins, and more.

Apache Slider 0.81.1-incubating was released. Slider is a framework and application for deploying existing distributed systems on YARN. The new release fixes several bugs and contains a few new features/improvements.

Apache Apex has released its first version, 3.2.0-incubating, since joining the Apache incubator. Apex is a data processing system for streaming and batch, and the new release contains many patches atop of the 3.1.0 release.

Apache Storm 0.10.0 has been released. In beta since June, this major new version adds support for secure multi-tenant deployments, Flux (a new framework for defining storm topologies), an improved logging framework, streaming ingest to Hive, and more.