Data Eng Weekly

Hadoop Weekly Issue #113

22 March 2015

This issue has more variety than we've seen in recent months. There are great technical articles covering everything from tuning AWS for Hadoop to Apache Flink to Hadoop with Python to Apache Tajo. In news, Tachyon Nexus announced a series A round. And in releases, two exciting new projects provide the ability to run HDFS on Mesos and to stream MySQL replication events to Kafka.

Technical

The Confluent blog has a post that provides suggestions for choosing the number of partitions in a Kafka topic. While more partitions will help improve throughput, increasing the number will result in more open file handles, (potentially) longer unavailability in certain circumstances, higher end-to-end latency, and additional memory requirements in clients. The post describes each of these trade-offs in-depth.

This presentation provides an up-to-date overview of the state of Hadoop with Python. It looks at several open-source frameworks, including mrjob and Pydoop for MapReduce jobs, snakebite for interacting with HDFS, and the python APIs included with Spark and Pig.

The Qubole blog has a post looking at the effects of different types and features of virtualization on the Amazon Web Services cloud. The post is worth reading in its entirety, but key takeaways are that switching from PV to HVM instances and enabling enhanced networking is a major win. They didn't see huge enhancements with placement groups. As always, its worth validating these results with your own application.

This is a good read about how one distributed data processing framework solves a lot of distributed system problems. Focussing on equi-joins, the post describes the high-level Flink API, join strategies, memory management, join optimization, and performance.

This post on the Cloudera blog describes the Spark-Kafka integration in the recent 1.3 release of Spark. Topics include creating RDDs for batch jobs, RRDs for streaming, and an overview of strategies for building at least once/at most once/exactly once delivery of results. The exactly-once section describes two strategies—idempotent writes based on unique keys and transactional writes.

A new paper on Spark analyzed performance on the BDBench and TPC-DS benchmarks and found some surprising results. Specifically, they found that CPU is often the limiting factor and not disk or network I/O. It's a big paper with a lot of interesting findings and suggestions for improvement.

The Hortonworks blog has a post on several new features that have been added to the Hadoop ecosystem in order to support rolling upgrades. It discusses some operational items like software packaging and configuration as well as the changes in core HDFS, YARN, Hive, and more. There are also instructions for the order in which to upgrade services as part of a full upgrade.

This post starts out with a story that's all too familiar for many people working with Hadoop—you have a seemingly simple query, but you spend a lot of time finding the right data to query. One solution to this problem is to keep every dataset in Hive and to use comments to describe the dataset. Then, Apache Falcon provides a nice interface to view and search datasets in Hive (in addition to several other features, which the article describes).

Hortonworks has a recap of talks at the recent Apache Slider meetup. There was a talk on running dockerized applications on YARN and another on KOYA (Kafka on YARN). The post also has links to the presenter slides.

While MongoDB has a built-in MapReduce framework, there are often advantages to processing data outside of Mongo. To that end, this post gives an introduction on how to integration MongoDB with Spark using the Hadoop input format for Mongo.

The LinkedIn Site Reliability team has pulled back the curtain to reveal a lot about how LinkedIn uses Apache Kafka. Topics covered include scale (175 terabytes/day), the types of applications (queueing, logging, metrics, and more), their multi-datacenter setup, and integration into the application stack.

Apache Tajo version 0.10 was released last week, and this tutorial provides all the instructions needed to get started with Tajo on an Amazon Elastic MapReduce cluster. After specifying a Tajo bootstrap action for the cluster, data is stored in HDFS. If you want to integrate directly with S3, the post describes the additional configuration required to do so.

MapR has posted a new whiteboard walkthrough, which compares and contrasts Hadoop with NoSQL systems. In addition to a short video, the transcript of the presentation is available on the MapR blog. It covers the the strengths of Hadoop vs. NoSQL and when each one is appropriate.

News

Tachyon Nexus is a new company from the folks at UC Berkeley's AMPLab behind Tachyon, the memory-centric distributed storage system. This week, they announced a Series A round of $7.5 million, led by Andreessen Horowitz.

InfoWorld has an article that recounts some of the themes of Cloudera's analyst day, which took place earlier this week. These include Cloudera's goal of being "the big data company," revenue (and how it relates to customers using Cloudera's free software), and competition with Hortonworks and the Open Data Platform.

A post on the Enterprise Software Musing blog also reports on Cloudera's analyst day. This post is more focussed on the specifics of Cloudera's business—they scale of their traction (adding on average two new employees and two new partners each day), their plans to expand into new verticals like financial services and telcos, and the importance of partners.

Hadoop-as-a-Service vendor Qubole has announced a new connector between their platform and Amazon Redshift. The integration provides the ability to save the output of queries run in Spark and Hive to a table in Redshift.

Mesosphere has announced a new open-source project to run HDFS on Mesos. When running HDFS via the system, all Datanodes, NameNodes, and Quorum JournalNodes are launched automatically. Enabling "Super High Availability" allows the system to automatically re-provision NameNodes.

The sqlstream project provides an integration between MySQL replication and Apache Kafka. Replication events are translated to JSON and sent to a kafka topic. The Readme shows some examples of the types of events one can expect.