Data Eng Weekly

Hadoop Weekly Issue #109

22 February 2015

There was quite a bit of news this week with the announcement of the Open Data Platform, Pivotal open-sourcing several systems, and announcements related to Strata+Hadoop World. I've highlighted a few major announcements (there were too many to cover all in-depth), and I've also found a number of interesting technical articles covering Spark, Kafka, Cascalog, and more.

Technical

This post provides one of the best descriptions of a Data Lake that I've seen. It also talks about several common problems with, misconceptions of, and best practices for productionizing a data lake.

The O'Reilly Radar blog has a post describing several compute frameworks for Hadoop--everything from SQL to machine learning to real-time. The post describes the key considerations for choosing a framework and gives some guidance as to when to use each.

Apache Spark is adding a new DataFrames API, which is inspired by data frames in R and Pandas (Python). DataFrames are like a table in a RDBMS, but contain additional optimizations. In particular, materialization of DataFrames uses the Spark SQL optimizer and code generation framework. There are more details on the API, which is planned for Spark 1.3, in the introductory post.

Answers is a near real-time mobile app analytics system built by Crashlytics/Twitter. The Twitter blog has a post describing the architecture of the system, which ingests billions of events per second. The system implements the Lamda architecture, using Kafka as the messaging layer, Storm for the speed layer, and EMR with Cascading for batch computation.

In last week's newsletter, there was mention of separating Spark from Hadoop. This week, Pinterest has written about just that--they're using Spark streaming with MemSQL for real-time analytics. The prototype system uses Spark streaming to take data from a Kafka topic, join it with dimensional data, and send the data to MemSQL.

The MSDN blog has a post about tuning performance of Sqoop jobs on Azure HDInsight. The suggestions are mostly distribution-independent (e.g. tuning number of map tasks, sizing the cluster and db properly), so it's a useful read if you're working with Sqoop.

Netflix recently announced the Surus project, which is an open-source library of analysis tools for Pig and Hive. This week, they added the second function to the library: Robust Anomaly Detection (RAD). The Netflix blog has an overview of the goals of the tool, the algorithm it implements, and how it can be used via Apache Pig.

This presentation describes best practices for building a data architecture. It contains ideas like using Kafka as a data bus, directory layouts for datasets in HDFS, using Spark streaming, and schema management. Lots of tips for building a reliable and consistent system.

Cascalog, the Clojure library for Cascading, has recently added support for customer Hadoop counters (on master). This post describes how to update counters as part of a Cascalog job and how to access the counters programmatically afterwards.

News

The Strata+Hadoop World conference was this week in San Jose. Videos of the Keynotes and select interviews have been published on Youtube. Included in the list is a Keynote by President Obama and the U.S. Chief Data Scientist, Dr. DJ Patil.

TechTarget has an overview of the benefits of a Hadoop-powered data lake. The article looks at Allstate and Solutionary Inc, who have both recently created data lakes. Example benefits include the ability to look at country-level data (at Allstate) for the first time and using large-scale machine learning to identify when home inspections aren't necessary for a homeowners insurance policy.

Hortonworks, Pivotal, IBM, GE, Verizon, and others announced the "Open Data Platform" (ODP) this week. The goal is to standardize Hadoop ecosystems components and versions to ease interoperability across distributions. Companies such as Cloudera, which didn't join the ODP, have responded negatively to the announcement. There have been a number of articles about this topic, but I find the Gartner blog has one of the best takes on both sides of the argument.

Related to the ODP announcement, Pivotal and Hortonworks announced that they'll be "aligning efforts around Hadoop." As part of this, customers can choose to use either Pivotal HD or the Hortonworks Data Platform, and Hortonworks will provide advanced support for enterprise customers of both distributions.

Pivotal made another announcement this week which is easy to overlook given all the discussion around the Open Data Platform. The company is open-sourcing Greenplum, HAWQ, and GemFire database products (and still offering licenses and support). Greenplum is the company's analytics data warehouse, HAWQ is the SQL Engine for Hadoop, and GemFire is a in-memory distributed database.

Datanami reports that Hadoop's lack of enterprise security features including fine-grained access control is limiting and sometimes preventing enterprise adoption. The post mentions some companies that are selling products to add additional security features.

Databricks and Intel announced a partnership to optimize Spark for Intel architecture. Intel's work on core Hadoop helped bring encryption-at-rest and other important features to the platform, so it should be interesting to see what comes of this partnership.

This post provides a recap of several themes that emerged at this week's Strata+Hadoop World. These include continued infatuation with Spark, security for Kafka, and a discussion around Spark streaming vs. Storm for stream processing.

Releases

IBM announced several new modules for their BigInsights distribution. These include BigInsights Analyst (for integrating spreadsheets and visualizations with their SQL-on-Hadoop engine), BigInsights Data Scientist (for machine-learning on large datasets), and BigInsights Statistical Management (for managing resources and optimizing workflows).

Cloudera announced that Apache Kafka has graduated from Cloudera Labs and is now fully-supported as part of Cloudera Enterprise. A technical post on the Cloudera blog describes how to deploy Kafka using CDH and includes some guidance for choosing hardware and sizing a cluster. It also describes various details of the architecture, such as replication, partitioning, and how to guarantee message delivery.

Microsoft announced availability of HDP 2.2, which includes Apache Storm, as part of their Azure HDInsight Hadoop-as-a-Service platform. They also announced a preview of HDInsight on Linux, which uses Apache Ambari for deployment.

Hadoop-as-a-Service company Altiscale announced two new features this week. First, Apache Spark has been fully integrated into their platform. Second, they're now offering secure-mode for Hadoop using Kerberos.

MapR announced version 4.1 of their distribution. Key features include a bi-direction data replication between MapR-DB clusters in separate data centers, a POSIX client for loading data into MapR FS, and a new C API for MapR-DB.

Cloudera has released version 1.1 of Cloudera Director, their tool for provisioning CDH clusters in AWS. This release includes support for dynamically-resizing a cluster and an integration with Amazon's RDS (database-as-a-service). The Cloudera blog has more details and enumerates features planned for the future.

Apache Gora is an in-memory data model and persistence framework for Apache HBase, Apache Cassandra, and several other data stores (both k/v and RDMBS). This week, version 0.6 was released. The release updates dependencies for several of the dependencies (HBase, Avro, Hadoop, and more) that it supports.