Data Eng Weekly

Hadoop Weekly Issue #17

12 May 2013

This week's newsletter is a little lighter than normal in technical news (some fascinating articles, though!), but there are a quite a few interesting releases and upcoming events. Hope you enjoy, and please let me know if you find anything that I missed! Also, thanks to everyone that has been spreading the word about this newsletter -- the number of new subscribers each week has been really encouraging.

Technical

LinkedIn has open-sourced a number of big data projects built on or to coexist with Hadoop. In celebration of LinkedIn's 10th anniversary, this post covers 10 of those projects (such as Voldemort and DataFu), including a brief overview of each.

Following the release of version 0.2.0 of the Cloudera Development Kit, the Cloudera blog has a new post with an overview of the project, an FAQ, and a list of future plans. They plan on having monthly releases and focusing on documentation in addition to software libraries. It should be an interesting project to watch.

An overview of eBay's data warehouse, which ingests as much as 100TB/day and stores over 90PB. To power internal analytics, they use a combination of Hadoop, Teradata, and a custom built system as data stores plus front-end tools Tableau, Excel, Microstrategy (and more).

RCFile is a columnar format that's part of the Hive project. This post describes the motivation for RCFile as well as the benefits. In a follow-up post, the author will talk about the successor to RCFile -- ORCFile, which has similar features to the Parquet format.

An honest review of a new Hadoop book, the "Hadoop Beginner's Guide" talks about both the good and the bad in the book. Overall, the review is positive but notes that there are a few technical issues that could be improved.

Russell Journey, the author of Agile Data, has posted slides to accompany his book. The slides cover a number of principles for Agile big data as well as a bunch of example code covering everything from data analysis with Pig to visualization with Bootstrap and D3.

When an HBase RegionServer fails, it can take a few seconds or minutes for the regions owned by that RegionServer to recover. Reducing this time, known as the Mean Time to Recover (MTTR), has been the subject of a lot of work on both the HBase and HDFS projects. This post has a good overview of the technical challenges and their solutions.

Releases

Snakebite is a new python project that makes use of protocol buffers to talk to HDFS without going through the JVM. In addition to an API, it supplies a command line utility with similar functionality to hadoop fs (without the JVM startup overhead, it's a lot faster) and a script to startup a mini HDFS cluster, which it uses for testing.

Azkaban 2.1 was released. This is the first point release to the Hadoop workflow management software since the version 2 rewrite. It has a bunch of new features, like JMX support, auto-retries, and SLA aware notifications.

Apache Giraph, the computation framework for bulk synchronous parallel programming (often used for network graph algorithms), had its version 1.0 release, the first since graduating from the Apache Incubator. This release has a bunch of features, including support for running within YARN, support for accessing Hive tables, and improved performance and memory efficiency.

Apache Curator (incubating), is a set of Java libraries for Apache Zookeeper. The project was originally started at, and open sourced by, Netflix. This week they had their 2.0.0-incubating release, the first since joining the Apache incubator.