Data Eng Weekly

Hadoop Weekly Issue #42

03 November 2013

It was great to have folks in NYC for Strata/Hadoop World this week -- I had a lot of interesting conversations. As expected, there was a ton of industry news this week as the graph of Hadoop ecosystem vendors became even more connected with many new partnerships and products announced. In addition, there was a deluge of technical content from dozens of talks at the conference and meetups surrounding it. Rather than trying to summarize them, I've provided links to the O'Reilly site containing slide decks. I tried to capture as much as I could, but there's inevitably something that I overlooked given the volume of news. Please email or tweet @joecrobak if I missed something important. Thanks!

Technical

Hadapt has announced a new feature in their SQL-on-Hadoop product that provides support for raw JSON data without a schema. The Hadapt system parses the data and builds a schema from the JSON structure as it's loaded. I've seen this approach before with Hive, but I haven't seen it with any of the other SQL-on-Hadoop vendors.

WibiData founder Aaron Kimball has a post on the highlights from the Hadoop ecosystem over 2013. He covers the advances in SQL/NoSQL, data formats, and more. He also offers some predictions about what's in store for 2014.

Episode 17 of the All Things Hadoop podcast featured an interview with Apache Drill developer Jacques Nadeau. Apache Drill is a low-latency query engine with similar goals as other SQL-on-Hadoop engines. It's based upon the Google Dremel architecture, although it has some extensions such as support for pluggable data formats.

The Hortonworks blog has an update on the work they've been doing to increase the latency and throughput of Apache Hive. They have some results with a trunk version of Hive running on Apache Tez across the TCP-DS and Berkeley AMPLab Big Data Benchmark. The results are impressive, although the AMPLab query results aren't particularly useful given that the amount of machines, disks, and RAM varied significantly between evaluations. With that said, I think it's great to see the improvements in Hive, which Hortonworks plans to ship in beta sometime in Q4.

MADlib is a library for doing in-database analytics like training SVM and Linear Regression. MADlib has been ported (there are implementations for Postgres and Greenplum) to run inside of Impala via the User Defined Aggregates functionality added in the Impala 1.2 beta. A post on the Cloudera blog covers details about the algorithms (which use a shared-nothing architecture for scalability) and detailed examples of running MADlib through Impala.

Steve Loughran of Hortonworks posted a throught-provoking piece about moving past Google as the inspiration for Hadoop ecosystem projects. His post is well-argued, citing the time it takes for a Google article to make it to press, the enormous scale that Google operates at (which is far ahead of everyone else), the plethora of other research work containing inspiration, and the lack of gritty details in Google's papers (such as operations). His post also argues that we're close to going full circle -- YARN has been cited in recent research papers. And hopefully soon Google will reference Hadoop-ecosystem projects in their papers, too.

Doug Cutting, the creator of Hadoop, spoke at StrataNY/Hadoop World about the future of Hadoop. The Cloudera blog contains some of his thoughts in the form of seven facts and predictions. Most of them are fairly uncontroversial, e.g. 'Hardware gets cheaper' and 'Data is valuable,' but there are a few controversial predictions like "In the future, our data software platforms will be open-source." The full post is worth a thorough read given Doug's visionary open-source work.

The Intel Blog has a post on Apache HBase security from HBase committer Andrew Purtell. The post starts by covering the Kerberos-based security model that was introduced to HBase in 2009. Next, it talks about HBase data models for access control lists and cells (KeyValue). It then goes on to talk about how HBase cell-level security is implemented via cell tags (on target for HBase 0.98). The post wraps up with a thorough walkthrough of the implementation -- which uses HFile version 3 and co-processors. It's a great read about the state of HBase and the upcoming security improvements.

The O'Reilly Strata blog has a post with observations from Strata NY/Hadoop World. It's one of the best recaps that I've seen of the event, including links to several of the talks it references. The author mentions six areas that seem to be abuzz -- SQL on Hadoop, BI tools for big data, approximation-based query speedups, startups doing machine learning for big data, realtime analytics, and hardcore data science.

ZDNet has an interview with Hortonworks cofounder and Hadoop committer Arun Murthy. Arun talks about the large shift in Hadoop precipitated by YARN, calling it "this datacenter Hadoop operating system." The article is a good overview of the history of Hadoop, the motivation for YARN, and Arun's vision of the future of YARN.

News

O'Reilly has posted videos of the keynotes from StrataNY/Hadoop World as well as some interviews on youtube. There are talks by speakers from Cloudera, Facebook, Black Girls Code, Infochimps, and more. Some of the talks are sponsored and most aren't particularly technical, but they have some interesting high-level discussions about the industry.

Among their announcements at StrataNY/Hadoop World, Cloudera announce a new program called "Cloudera Connect: Innovators." The idea is to support new and innovative software as part of Cloudera's enterprise support offering. The first partner in the program is Databricks, the startup commercializing Apache Spark. Spark is a computation framework that complements MapReduce by caching large datasets in RAM. It sounds like Cloudera will create a Spark add-on to Cloudera Enterpise 5 (announced this week in beta).

Cloudera also announced a number of partnerships at Hadoop World. First, they have a new program called "Cloudera Connect: Cloud" for integrating Hadoop into the cloud. Inaugural partners include Verizon Enterprise Solutions, Savvis, SoftLayer, and T-Systems. They also plan to support for Amazon Web Services and private cloud deployments in the future. Second, alongside the release of Cloudera Enterprise 5 beta, Cloudera announced/reiterated partnerships with companies supporting the new release. There are 25 companies on the list.

Rackspace announced that they're partnering with Hortonworks to bring HDP to the Rackspace public cloud and managed hosting environment. The service seems to be in beta now (you have to fill out a form to become a tester), and there's no word on if it's based on HDP 1.x or HDP 2. In any case, this is one of the first solutions I've seen where you can outsource provisioning, deployment, and management of the Hadoop stack running on bare metal hardware.

Hortonworks and Pivotal announced that Spring for Apache Hadoop has been certified to run on the Hortonworks distribution, HDP. Spring for Apache Hadoop is certified to run with HDP 1.3, but there's also support for Apache Hadoop 2.0.6 alpha, so I wouldn't be surprised if HDP 2 support was right around the corner.

Hortonworks announced that HP will resell HDP. It's interesting that HP has decided to partner with a vendor rather than build their own distribution (like Intel and IBM). If nothing else, the growth in the number of Hadoop distributions has slowed.

Microsoft announced the general availability of its Hadoop in the cloud offering, Windows Azure HDInsight. HDInsight supports .NET as well as Java, and they expect support for HDP 2 for Windows Server to come within the next month (HDInsight is built on HDP 1).

SAS and Cloudera announced an expanded partnership. In particular, they announced a new integration with SAS and Impala -- the SAS/ACCESS Inteface to Impala. The announcement was made this week, but the software is expected in December.

Releases

Cloudera has released a beta release of CDH 5 and Cloudera Manager 5 (together referred to as "Cloudera Enterprise"). The release is based upon Hadoop 2.2, HBase 0.95.2, Hive 0.11 (including much of Hive 0.12), and Oozie 4.0. It also includes Cloudera search and Impala, which has been updated to include native UDFs, HDFS caching, and integration with YARN. The new release of Cloudera Manager also has a new extensibility feature to allow deployment of 3rd party applications via Cloudera Manager.

A bug fix release of Apache Cassandra, version 2.0.2, was released. The release changes the speculative retry default to 99th percentile, adds configurable metrics reporting, persisting of compaction history stats to a table in Cassandra, a new consistency level (LOCAL_ONE), and more.

The Kiji BentoBox hit version 1.2.3 this week. It includes updates to nearly all components of the Kiji ecosystem. Most notably, the release includes a new Protobuf cell encoding (in addition to Avro) and support for freshening via KijiREST. It also includes various bug fixes and improvements.

Version 0.3.0 of Knox, the Apache Incubator project, was released. Knox is a REST Gateway to Hadoop clusters. This version includes LDAP authentication, support for Kerberos-enabled Hadoop clusters, HBase & Hive integration, and more. The Hortonworks blog has some more background and details.

Amazon Web Services announced that Elastic MapReduce now supports Apache Hadoop 2.2. It also supports HBase 0.94.7, Mahout 0.8, Hive 0.11, and Pig 0.11.1. AWS has also worked on speeding up cluster startup times (which are down an average of 60 seconds).