Data Eng Weekly

Hadoop Weekly Issue #95

09 November 2014

This week’s issue has great technical content including articles about data infrastructure from small companies, Buffer and Asana, to a large company, Facebook (and their big data challenges). There’s also coverage of a diverse set of topics related to YARN - Kafka on YARN, a comparison of YARN and Mesos, and the YARN timeline server. In industry news, Databricks recent sort benchmarking results have earned a tie for first place in this year’s Daytona GraySort contest.

Technical

The Buffer developer blog has a post on how they’ve evolved their analytics data infrastructure from just Mongo and Amazon SQS to also include Hadoop and Redshift. They use Mortar’s Hadoop-as-a-Service to run Pig scripts which load data from Mongo to S3 to Redshift. Luigi, the open-source Hadoop workflow engine from Spotify, is used for orchestration.

Facebook recently posted about several data problems that the company is facing. The look at big data challenges gives you a flavor for Facebook’s data sizes/volumes and internal systems (several powered by Hadoop). Among the problems are those faced by many folks working with big data infrastructure - e.g. how to sample data, which types of compression to use- and some which are unique to large scale companies- e.g. distributing a data warehouse across data centers.

The Cloudera blog has a post on using Spark Streaming for doing near-time session analysis. The post includes an example job which feeds data into HBase to power BI tools via the Hive adapter. The code for this system is available on github, and the post has a detailed look at what the major parts of the example Spark streaming job are doing.

This post looks at the relationship between YARN and mesos. There’s a fairly direct mapping between major components (e.g. YARN ResourceManager ~ meson-master with meta-scheduler), but resource allocation is different in the two systems (Mesos is push-based, YARN is pull-based).

Hortonworks has posted a video, slides, and a Q&A from a recent webinar on the new features and improvements in Hive as part of HDP 2.2. The new features in this version (which includes the first set of deliverables from stinger.next) include support for insert/update/delete and the cost-based optimizer.

This post shows how to deploy the YARN Timeline Server using Apache Ambari blueprints. The timeline server is still a work in progress, but you can get an idea of what types of information it currently supports with the screenshots linked to in the post.

DataTorrent has blogged about a new project to bring Apache Kafka to YARN. The so-called KOYA (Kafka on YARN) project plans to leverage YARN for Kafka broker management, automatic broker recovery, and more. Planned features include a fully-HA application master, sticky allocation of containers (so that a restart can access local data), a web interface for Kafka, and more. The post invites folks in the community to help build KOYA.

O’Reilly Radar has a post on schemas for data. It discusses why it’s tempting to use formats with implicit schemas (e.g. JSON, CSV), the benefits of schema, and why Apache Avro is a good solution. There’s a bit of detail on Avro and its file format, which stores the schema with the data.

The Cloudera blog has a post on the role of HBase in the Hadoop ecosystem. It discusses when it’s more appropriate to use Cloudera Impala (or any MPP engine atop HDFS) vs. HBase. Often times folks end up duplicating the data between systems, which leads to overhead and questions about the source of truth.

Mortar Data has posted a video (and slides) of a presentation by Mayur Rustagi of Sigmoid Analytics on the Pig-on-Spark initiative. The presentation is from the NYC Pig User Group meetup that took place during Strata + Hadoop World.

Asana has written about the evolution of their data infrastructure and the tools that they’re using. Like Buffer, Asana is loading data into Redshift and is using Luigi for managing dependencies. They are also using Elastic MapReduce. The post walks through their philosophy for build data infrastructure—mainly don’t over engineer things from the beginning.

The Cloudera blog has a post about integrating Flume with Kafka. On the Kafka -> Flume side, the integration allows you to deploy Kafka and serialize data to HDFS, HBase, or any other Flume sink without writing any custom code. The integration also supports Flume -> Kafka, in which case a local agent can buffer data. The post also describes upcoming work on a Kafka Channel for Flume.

Amazon recently announced a new Linux AMI version 2014.09. While it’s not yet the default AMI for Elastic MapReduce, it offers a lot of compelling features for building a Hadoop (or other big data) cluster in AWS. Those features come via the 3.14.19 Linux Kernel, which includes improvements for memory management (zram, zcache, zswap), tcp (fast open enabled by default), and btrfs. This post discusses how those improvements might enhance performance of different systems in the hadoop ecosystem.

News

GridGain, makers of an in-memory "data fabric," have submitted their code to the Apache Incubator. The new project is known as Apache Ignite (incubating). In the announcement, GridGain touts it as a mature in-memory computing platform that can easily integrate with Hadoop.

Datanami has a report on the state of security for Hadoop. While a number of new projects have cropped up to add authorization, authentication, and encryption to the ecosystem, these are still pretty immature. Commercial add-ons are looking to fill this security gap. Datanami speaks with folks from Dataguise and Zettaset about the state of commercial support.

A trio of LinkedIn veterans who have worked on Apache Kafka and other data infrastructure projects have started a new company called Confluent. They will be focussing on Kafka and realtime data and have publicly committed to continue to work on Kafka (and potentially other tools, too) in open-source. There are more details about the new company in a post on LinkedIn.

Plunger is a new open-source tool from Hotels.com for unit testing Cascading pipelines. The github project readme has several code examples of the API. The framework provides a number of utilities for testing (such as pretty printing data and testing serializers).