Data Eng Weekly

Hadoop Weekly Issue #61

16 March 2014

This issue of Hadoop Weekly is overflowing with top-notch technical articles. There’s coverage of several parts of the ecosystem, from Zookeeper to Oozie to YARN. In addition, Kafka, Zookeeper, and Tez saw releases this week, and new features of the Kafka and Tez were releases were detailed in depth.

Technical

Episode 19 of the All Things Hadoop podcast has an interview with Adam Fuchs, Apache Accumulo PMC member and committer. The podcast covers the Accumulo data model, implementation, client-server architecture and more.

A post on the Pinterest engineering blog explains the evolution of their Zookeeper deployment. It talks about how they use Zookeeper for service discovery, some of the failure scenarios that can occur with Zookeeper, some early attempts they made to mitigating these failures, and the ultimate solution that Pinterest built. The solution uses a separate Zookeeper daemon per server that writes configuration files to the local file system for services to consume. It’s similar to AirBnB’s SmartStack, if you’re familiar with that.

The Cloudera blog has a post on Oozie High Availibility, which is implemented as an active-active system. For synchronization in the HA system, Oozie uses Zookeeper for distributed locks. It also requires a HA database and a load-balancing strategy for accessing the cluster. The post describes some of the subtler parts of the system, such as retrieving log files and security in more detail.

Another post on the Cloudera blog has an interesting analysis of using solid-state drives (SSDs) for MapReduce. SSDs provided higher sequential and much higher random throughput than hard-disk drives (HDDs). The performance comes at a much higher cost per TB, though, and MapReduce’s sequential I/O achieves maximum throughput from HDDs. The post concludes that the cost of SSDs outweighs the performance gains. This is one of the first analyses of its kind that I’ve seen, and I hope we see more in the future (especially with other applications like HBase and Spark as well as using a larger number of smaller SSDs).

Performing a major version upgrade of your Hadoop distribution is a harrowing task. Luminar recounts their experiences upgrading from HDP 1 to HDP2, a summary which includes information that’s relevant regardless of your distribution. For instance, Luminar worked with Hortonworks to build a full script for the upgrade and performed a walkthrough (both practices that I’ve found useful in the past). The post also talks about some things which went wrong.

Hortonworks has reposted part of an analysis by a Hadoop contributor on the trajectory and makeup of the Hadoop source code repository. While the company-centric analyses are always controversial, there are some really interesting take-aways around the new lines of code and changed lines of code (2013 saw significantly more new lines of code vs 2011-12, but fewer changes to existing lines of code). The post also contains some commentary on the role of the Apache Software Foundation in the role of Hadoop development.

It’s looking more and more like 2014 is going to be the year of Apache Spark and other MapReduce successor frameworks. A post on Dice contains a good overview of Spark including a concise overview of Spark Streaming. The author gives one of the first reviews I’ve seen from someone using Spark in practice (the author notes that GraphX, which is still in beta, is a bit buggy) albeit on a 6-node cluster.

A Hadoop rack awareness script informs the NameNode to which rack a particular node belongs. The information is used to allocate data blocks across racks to survive a rack failure. While several example scripts can be found online for Linux, building a script for Windows is less common. This post walks-through building a script using Windows PowerShell, which is a scripting language built on .NET.

The Python Natural Language Toolkit, or NLTK, is a batteries-included natural language processing framework. NLP problems tend to be easy to adapt to MapReduce given that a text corpus can often be split into documents, paragraphs, sentences, etc. This post covers using NLTK and the mrjob python/hadoop library to find the most common proper nouns in a dataset (in this case Moby Dick).

The sequoia blog has a post about doing Lucene indexing with Hadoop MapReduce. The post, which includes several snippets of example code, describes the problem and elaborates on a custom OutputFormat that the implementation uses to write Lucene indexes to a temporary location on the local file system before copying them to HDFS on task completion.

Oracle R Advanced Analytics for Hadoop is paid software for running distributed computations with MapReduce from R. A post from the Rittman Mead blog has an overview of this product as well as detailed instructions on setting it up on a CDH4.5 cluster running on RHEL. Oracle provides an evaluation version for developers to test it out.

The SequenceIQ blog has an overview of configuring the YARN capacity scheduler for several queues, and examples showing how to submit jobs to a particular queue. They also have some code-snippts showing how to parse data from the YARN scheduler API to inspect the queues and jobs at runtime.

News

HBaseCon host Cloudera has announced the keynotes and breakout sessions for the conference, which takes place in May in San Francisco. Keynotes include speakers from Google, Facebook, and Salesforce.com.

Gartner recently released their annual “Magic Quantrant for Data Warehouses” report, and Datanami has a recap of it. For the first time, Gartner has included the offerings of several Hadoop and NoSQL vendors—including Cloudera, MarkLogic, and Amazon Web Services (for RedShift and Elastic MapReduce). Datanami has more details on the report, including some of Gartner’s predictions like “few of the upstart data warehouse vendors will survive past 2016."

Qubole has compiled a list of Hadoop influencers to follow on twitter. It’s a great list if you're getting started with Twitter or Hadoop and need a list of folks active in the community to follow for the latest news.

Releases

Version 0.12.0 of the Kite SDK was released. Kite is a library for building Hadoop systems, and the new release includes new MapReduce support and new features in the morphlines library (which is a framework for facilitating ETL).

Apache Tez 0.3 was released. Tez is a framework for doing distributed computation on a data flow graph, a generalization of the MapReduce framework. The new release includes support for secure Hadoop and improved scalability, fault tolerance, and stability. A post on the Hortonworks blog highlights some of the testing they’ve done at scale and the upcoming integration of Tez with Hive, Pig, and Cascading.

Apache Kafka 0.8.1 was released. Kafka is a distributed messaging system that’s often used for data ingestion as part of a Hadoop deployment. Despite the patch-level version increment, the new release includes several new features to make Kafka easier to operate and a new log compaction feature. A write-up by Kafka committer and PMC member Jay Kreps has more details on the release.

Ferry is a new system that lets you run distributed systems on a single Linux machine using Docker. It's a system that will be quite useful for building prototypes by running isolated instances in linux containers. Ferry currently includes support for Cassandra, Hadoop, and Gluster/OpenMPI.

The open-source Mortar Framework, which is the self-proclaimed “Rails for Pig”, has a new release that substantially eases starting a Pig REPL. By streamlining the Pig install behind-the-scenes, starting a Pig REPL is only a single command once the mortar repo has been cloned.