Data Eng Weekly

Hadoop Weekly Issue #63

30 March 2014

Hortonworks announced a new round of funding this week, and Intel and Cloudera announced a major new partnership. There’s a lot of money being put into the Hadoop ecosystem, which is rapidly changing. Lots of articles this week cover the evolving set of frameworks making up Hadoop data pipelines like Storm and Spark.

Technical

The Cloudera blog has a guest post about Spark Streaming from engineers at Sharethrough. The post walks through their migration from a batch-processing system using Scalding to a micro-batch system using Spark Streaming. The new architecture means that data is reflected in the system withins seconds rather than an hour. The post goes into some of the technical details and lessons learned during their migration.

Spotify has introduced Storm into their backend to complement batch-processing with Hadoop. This talk gives some insight into their deployment, how it fits into their data pipeline, details on some of the features they’re powering with storm, and more.

The Hortonworks blog has a post on the recently released Apache Storm 0.9.1-incubating. The article details the new Netty-based messaging transport, added support for Windows, and a switch to Maven for builds. It also covers what to expect in the next release of Storm.

Cloudera has integrated Apache Sentry, the fine-grained authorization system for Hadoop, with Cloudera Search, the Apache Solr integration with CDH. A post on the Cloudera blog details the authorization and authentication layers in Cloudera search as well as how secure impersonation is done from Hue.

The MapR blog has a post about several key terms related to big data. It covers the difference between data stream management systems (DSMS) and database management systems (DBMS), batch processing vs interactive mode, and real-time vs low-latency. It also talks about the very overloaded ‘streaming’ term in the Hadoop ecosystem.

Hortonworks is planning to ship Apache Falcon (incubating), which is a data management and governance system for Hadoop, with HDP 2.1. They’ve published a post describing what Falcon does in depth. It also includes tutorials to build example pipelines (using Pig) and implementing cross-cluster replication.

Packt has a post from authors of the book “Storm Blueprints: Patterns for Distributed Real-time Computation” on running Storm on YARN. The post gives a brief overview of Hadoop focusing on how Storm complements MapReduce for real-time processing. It then talks about the architecture of Storm on YARN.

The Databricks blog has a post describing a new feature recently added to Apache Spark called Spark SQL. Whereas Shark uses Spark as a backend to Hive, Spark SQL provides a mechanism to invoke distributed SQL from a Spark job and perform additional processing on the data using Spark’s RDDs. It also enables persisting of Spark RDDs to Hive. The post has a detailed overview of the system and its optimization framework called Catalyst.

Doing anything interesting with large data tends to be a tall task. In addition to compute horsepower, there is a lot of infrastructure required to do something non-trivial. A post on Datanami explores the under respected task of data cleaning, which often ends up being a large part of the data pipeline. The article includes interviews with some folks in industry about the importance of scrubbed data.

Answering the question “What is Hadoop?” is becoming increasingly difficult (it’s a question I ask every week as I’m evaluating articles for this newsletter). The Gartner blog has a post exploring this topic, including how each of the vendors have taken on different set of components for their distribution.

Hortonworks announced their Series D round of funding, which totals $100 million. In addition to capital from existing investors, the new round was led by BlackRock and Passport Capital. Hortonworks says that they'll be using the money to scale engineering efforts, global operations, and their ecosystem.

There’s been a lot of discussion on the Apache Mahout mailing list about the future of the project. GigaOm has an article summarizing the output of the discussion—the Mahout community has decided to support Apache Spark and the H2O framework rather than MapReduce.

On the heels of $160M in funding announced last week, Cloudera and Intel announced a deal in which Intel is investing a rumored $90M+. In addition to the investment, Intel is dropping their own distribution and will work with Cloudera on CDH. The Cloudera blog has detailed their thoughts on the partnership, and SiliconANGLE has commentary about the industry-wide implications of the deal.