Data Eng Weekly

Hadoop Weekly Issue #87

14 September 2014

There were several releases in the Hadoop ecosystem this week, including Apache Hadoop 2.5.1 and Apache Spark 1.1.0. There’s a lot of interesting technical content, including testing HBase’s consistency with Jepsen and an in-depth look at an end-to-end big data infrastructure with Hadoop. On that node, there’s an interesting look into the growing demand for Data Engineers to build out Hadoop infrastructure.

Technical

A post on The AWS Big Data Blog covers custom configuration of Elastic MapReduce (EMR) clusters using bootstrap actions. One such bootstrap action, which is presented as an example in the post, installs Presto (the SQL-on-Hadoop system open-sourced by Facebook).

The latest post in a series on frameworks for big data analytics looks at Shark, Hive-on-Spark, and Spark SQL. The post describes the design/architecture of Shark and Spark SQL in detail. Spark SQL has the interesting quality of enabling SQL queries over data that Hive doesn’t know about, such as a local JSON file.

The Hortonworks blog has another set of curated Hadoop Summit content, this time focussing on Apache Hive. They highlight slides and video from seven presentations, which cover ACID for Hive, Hive and Tez, the cost-based query optimizer for Hive, and more.

This is a fantastic article about all the data plumbing/infrastructure that’s required to build a production big data system. There are several parts, each covered in depth—cluster planning (which includes reference architectures), data ingestion (batch ingest, event ingest, storage formats, data partitioning, access control), data processing (data transformation, analytics), egress/querying, and productionization. This is one of the best and most complete guides to what a big data platform with Hadoop should strive towards in order to be successful.

Apache Kafka is gaining popularity as a tool for data ingestion into Hadoop clusters. Unlike other systems, such as Apache Flume or Scribe, Kafka is a pull-based system, which allows for multiple consumers/destinations of data (rather than just HDFS or HBase). This article introduces Kafka and includes an example use-case of using Kafka to flag transactions in a massively multiplayer online game. There’s also an in-depth comparison of Kafka and Flume, which explores the advantages of and trade-offs between the two systems.

The MapR blog has a two part-series on OpenTSDB. The first part introduces the notion of time series data and OpenTSDB’s data model/API. The second article covers backfilling a massive amount of time series data into OpenTSDB. For this, they used MapR-DB (which is compatible with the HBase API) and a modified OpenTSDB that supports bulk importing (the code for these changes is available on GitHub). With these changes, they can load about 110 million points/sec.

This post covers the coding standards for Apache Hadoop. It discusses much more than just code style—best practices covering everything from concurrency to logging in detail. If you’re planning to submit code to the Hadoop codebase, it’ll be useful to get familiar with these (formerly unwritten) policies and rules.

Hue, the web front-end for Hadoop clusters, is a hybrid python/java application. Given that those technologies can pose some challenges in setup, this article has a walkthrough of building a Hue development environment on Ubuntu 14.04.

StackIQ makes tools for managing HPC, cloud, and big data clusters. The StackIQ Cluster Manager integrates with Apache Ambari (using the REST api) for provisioning or adding nodes to a Hadoop cluster. This post walks through the manager’s CLI, rocks, and shows how to use it to do several administrative tasks.

This post explores the internals of the YARN Fair Scheduler. Throughout the post, it explores how the Fair Scheduler differs from the Capacity Scheduler—both on features and implementation. The bulk of the post describes what happens during the scheduler event loop (events such as NODEADDED or APPADDED).

The folks at SequenceIQ are at it again with integrating parts of the Hadoop ecosystem with Docker. This time, they’ve announced a preliminary Docker image for Apache Drill which allows you to query data on a shared (from the host machine) data directory. This post introduces the key parts of Apache Drill, explains how it’s been integrated with Docker, and provides some examples of how to use Drill in Docker.

Jepsen is a tool to test distributed databases by simulating network partitions and quantifying the database's consistency and availability. The Call Me Maybe series by @aphyr has looked at a number of databases using this tool, and the Yammer blog looks at a new one—Apache HBase. HBase strives to be consistent in the case of a network partition (as a result availability will suffer), and the results of the Jepsen testing agree with that (be sure to checkout the addendum for some clarifications of the results).

This post has several details on comScore’s big data infrastructure. They ingest terabytes of data each day to a 400-node MapR cluster. The post describes some of the other tools that comScore uses, such as SyncSort’s DMX to sort data as it is being loaded (which helps compress the data much more efficiently). In addition to tools for SyncSort, comScore has a 200-node EMC Greenplum cluster.

This post introduces Accumulo’s server-side programming hooks—Filters, Combiners, and Iterators. While Filters and Combiners are quite simple, one must dable with Iterators to do more complex operations (such as consuming one type of data but producing another. The post walks through code snippets of a few Iterators, and the full source is available on Github.

News

Two chapters on an upcoming book on Hadoop Security are available in the early preview program from O’Reilly. If you’re thinking of preordering, the Cloudera blog has details on the goals of and planned content for the complete book.

Linux.com has the story of Hadoop’s move from SVN to Git. The article includes interviews with several folks from the ASF in which they discuss the motivation for switching (tooling, easier feature branches, easier sharing of code) as well as some of the trade-offs (namely the lack of fine-grained access controls). The article also details the set of steps taken to do the SVN to Git migration.

At the Intel Developer Forum this week, Intel and Cloudera spoke about technical collaboration coming out of their business partnership. Specifically, Cloudera’s distribution has been optimized for the new Intel Xeon E5 v3 processor, which Intel says is >2x more performant at running Cloudera software. Intel also said that they expect Hadoop to be the top application on data center servers within the next couple of years.

TechRepublic has an interview with Peter Cnudde, VP of Engineering at Yahoo, on Hadoop at Yahoo. They talk about massive scale of Yahoo’s Hadoop and YARN deployment, some of the interesting challenges & opportunities this presents, the advantages of Hadoop for enterprises and non-web companies, and how Hadoop (and its ecosystem) fit together with non-Hadoop enterprise data warehouse systems.

Datanami has an article on the growing demand for data engineers—the type of engineer that works with Hadoop to build out core infrastructure like data ingestion and data quality. The article notes that data engineers often work in conjunction with data scientists and that data engineers are quite difficult to find.

Apache Spark 1.1.0 was released. The new version includes improvements to MLlib, Spark SQL, PySpark, and Spark Streaming. It also includes support for memory management, several improvements for monitoring Spark jobs, and an improved integration with Apache Flume. Both the Spark website and the Databricks blog have more details on the new features.

Apache Cassandra 2.1 was released. A post on the Apache blog touts aspects of the new release including performance improvements (over 50% better), production support for Windows, and the CQL3 tuple and user defined type (UDT).

The Metanautix Quest Data Compute Engine is the latest entry into the SQL-on-Hadoop space. Their offering is commercial and aims to support a wide-array of data sources—from data in a OracleDB to data in Amazon S3. More details about the product in an introductory blog post.