Data Eng Weekly

Hadoop Weekly Issue #123

31 May 2015

As has been the recent trend, a number of posts in this week's issue are only tangentially related to Hadoop. I've included them in the hopes that they're useful for folks working with distributed systems (whether built atop of Hadoop or not). For instance, there's a fantastic article on using logs for data integration, a post on Mesos/Omega/Borg, and a post on the consequences of disk wiping in distributed consensus.

Technical

Pipeline scans are a new feature targeted for a future release of Apache HBase. For scans over large numbers of records, the client will prefetch additional rows as the current batch is being processed. The Yahoo Hadoop blog has more details on the implementation and provides some experimental results in which the feature improves throughput by nearly 3x.

IBM BigInsights 4.0, which was released in late March, supports SQL querying of data in HBase. It includes a number of important features like windowing and OLAP aggregate functions, nested sub-queries, predicate pushdown, and secondary indexes. The IBM Hadoop Dev blog has many more details on the features of SQL-on-HBase in BigInsights.

The morning paper covered some publications relevant to folks working with Hadoop. First is a paper from Google on "Pregel: A System for Large-Scale Graph Processing" (and the inspiration for Apache Giraph). Second is GraphLab, which is a framework for parallelizing "...asynchronous iterative algorithms with sparse computational dependencies..." Third, Distributed GraphLab describes how to evolve the GraphLab abstraction to a distributed setting.

The Confluent blog has a transcript and the video of a recent talk by Martin Kleppmann entitled "Using Logs to Build a Solid Data Infrastructure (or: Why Dual Writes Are a Bad Idea)." The post describes the challenges involved in data integration, how an append-only log can be used to solve these, how logs are using in db storage engines, db replication, distributed consensus, & Apache Kafka, and how to build a data integration using a distributed log.

This post provides a great overview and summary of the Mesos, Omega, and Borg papers. It provides some background by contrasting the problems of heterogenous datacenters to homogenous HPC clusters. Next, it describes Mesos' two-level scheduling, Omega's optimistic scheduling, and Borg (which is the production datacenter scheduler at Google). There are a number of interesting details from the Borg paper mentioned, such as the median cluster size of 10K nodes and the distinction between priorities when scheduling services and batch jobs.

Distributed consensus implementations often use a disk for persisting data. With that in mind, it's still a bit surprising that wiping a disk can lead to data loss in a system like Zookeeper. This post describes the details of the problem (using Zookeeper as a reference), and it explores several solutions (e.g. using super-majorities and db tokens).

This post describes how a data infrastructure based around Apache Kafka can be used to populate multiple (e.g. a new/prototype) data stores in parallel. This strategies enables much more informed decisions than a all-at-once switchover.

This post describes how to install and configure Apache Sentry and Sqoop2 such that Sqoop2 uses Sentry's Authorization. It also gives some examples of creating users/roles and verifying that the permissions work as intended.

When getting started with Oozie, it can be confusing to understand what is happening as you submit jobs via the oozie client. This post describes the process in detail and some common pitfalls. Specifically, it looks at common issues and workarounds related to the oozie "launcher job" failing: running out of memory, deadlocking a cluster, and configuring Hive's scratchdir.

The databricks blog has a guest post on tuning garbage collection for Apache Spark. The post is full of lots of details, including a description of how Java GCs work, an overview of Spark's memory management, notable JVM arguments related to GC, tips for analyzing GC performance, and more.

This walkthrough (both in video and transcript form) describes the architecture of Apache Drill. It covers things like when Drill takes advantage of data locality, the components of the Drill cluster (it's a homogenous architecture), and connecting to Drill via ODBC/JDBC and REST.

This presentation contains practical advice and information related to Apache Flink (there have been some good introductory posts/presentations in previous issues of Hadoop Wekly). It covers things like running a Flink cluster, unit tests for Flink, debugging Flink (including remote debugging), job tuning, and much more.

I really enjoy reading about folks' practical experiences with Hadoop (whether good or bad). This post describes what "bad things will happen" when filling the datanode disks (in this case during a distro upgrade). It details the symptoms, including snippets from the logs, and suggests a few setting changes to mitigate the problem.

News

Hortonworks has announced a new program for academic institutions to train students. Universities that are a Hortonworks Academic Partner get access to Hortonworks course materials for HDP Operations, HDP Developer, and/or HDP Data Science.

Releases

Apache HBase has published "CVE-2015-1836: Apache HBase remote denial of service, information integrity, and information disclosure vulnerability". There are hotfix upgrades of the 0.98, 1.0.1, and 1.1.0 releases, and the following post describes the mitigation steps.