Data Eng Weekly

Hadoop Weekly Issue #96

16 November 2014

Big news this week out of Palo Alto as Hortonworks has filed paperwork for an initial public offering. There were also a number of notable releases this week, including Apache Hive 0.14.0. Technical posts cover a large number of ecosystem topics, including Apache Sqoop, Apache Drill, and Apache Pig. There’s a lot of breadth in this issue, so there should be something for everyone!

Technical

The Cloudera blog has a guest post from Cerner about integrating Apache Kafka with HBase and Storm for real-time processing. The post describes how adopting Kafka helped reduce load on HBase (which was previously used for queuing) and improve performance. This style of Kafka-based architecture seems to be more and more common, but it’s always interesting to hear how folks are putting together the pieces of the Hadoop ecosystem.

The MapR blog has a post on using the recently-released Apache Drill 0.6.0-incubating to analyze Yelp’s public data set. The data, which is a JSON file, can be queried directly via SQL in Drill without first declaring the data’s schema (drill auto-detects it). The post has a number of sample queries which you can use to get started analyzing this or any other data set.

The Cloudera blog has a second guest post, this time from Dell, on the new Oracle direct-mode in Sqoop 1.4.5. The post describes several of the implemented optimizations in the Oracle direct mode and includes an analysis of performance improvements the connector provides.

The Hortonworks blog has a post on using Apache Pig with the Python Scikit-learn package in order predict flight delays using logistic regression and random forests. The post is a bit light in details, but there is a linked IPython notebook which has a very detailed overview and description of the entire process. Given that Python is often a data scientist’s top choice for machine learning on small data sets, it’s useful to see how to extend it to larger data sets with Pig.

The ingest.tips blog has a post on Sqoop1 support for Parquet, which leverages the Kite SDK to generate Parquet files during import. The post serves as a good introduction to Sqoop1, which can both import data to HDFS and update the Hive metastore with information about the data. There are examples demonstrating how to use Parquet support.

Tephra is a open-source system that provides globally-consistent transactions for Apache HBase. Cask, the makers of Tephra, have written a blog post describing the requirements and design of Tephra. Tephra is designed in such a way that it can be used with systems other than HBase, and it is even designed to support transactions spanning multiple data stores.

This presentation focusses on Spark streaming, the micro-batch component of Apache Spark. The slides give an introduction to both Spark and Spark streaming, describe several use cases (claiming there are 40+ known production use cases), give an overview of several integrations (Cassandra, Kafka, Elastic Search, and more), and look ahead to some upcoming features and improvements in the development pipeline.

News

Hortonworks has filed paperwork for their initial public offering this week. The filing includes a number of details on the company, including financial numbers ($33.4M in revenue so far in 2014), an overview of key company milestones, and number of employees (524 at the end of September). GigaOm has an analysis of some of these numbers and an overview of what the IPO means for the rest of the industry.

IBM’s Big Data for Social Good Challenge opened this week. The challenge includes $40k in prizes, which will be awarded by a panel composed of IBM and industry experts. IBM has a curated list of datasets which can be used as part of a challenge entry.

Cubert is a new open-source tool from LinkedIn for writing high-performance MapReduce jobs. It’s a new language on the same level of Pig or Hive (sharing some resemblance to Pig) as well as a novel storage format/layer called blocks. For statistical calculations, graph computations, and OLAP cubes, Cubert offers impressive performance improvements. There’s a lot more information in the introductory blog post.

Apache Hive 0.14.0 was released this week. The release resolves over 1,000 (!) Jira issues. I’m sure we’ll soon hear more details about the release in blog post form but some quick highlights include: support for insert/update/delete with ACID support, a cost-based optimizer, support for data stored in Accumulo, support for HBase snapshots, and many improvements to ORCFile and HiveServer 2.

Microsoft released version 2.5 of the Azure SDK and a preview of Visual Studio 2015. The releases contain support for HDInsight (the Hadoop as a Service component of Azure) including a Hive query editor and job viewer.