Data Eng Weekly

Hadoop Weekly Issue #233

17 September 2017

Lots of great technical content this week, including posts on Kafka, SparkR, and Amazon EMR. And if you're looking for more, the Kafka Summit videos and slides are online. In releases, Kudu, Impala, Kafka, and Storm all have new versions out this week.

Technical

While most of the articles I highlight target distributed systems, this one covers some python tools for training a model and serving results. It focusses on the data engineering aspects of that work—from how to get started with a simple example to scaling up in a production environment.

Confluent has written about the testing of Apache Kafka. They highlight that before any code is written, the design is "tested" through an open Kafka Improvement Process. From there, code undergoes unit tests, integration tests in a single process, and system tests that involve performance and correctness testing across multiple instances and with injected faults and more.

This post provides an interesting overview of the types of IO supported by the Linux kernel (vectored IO, memory mapping, and async IO) and the trade-offs between them. While not necessarily useful everyday, many of the data systems covered in this newsletter make use of some of the advanced features of mmap like fadvise.

Hortonworks has done a performance analysis of Apache HBase and Apache Cassandra using the Yahoo Cloud Serving Benchmark. For the testing, the services are configured to read and write data on AWS' attached SSD storage. Unsurprisingly, HBase was faster for reads and Cassandra performed better when workflows are write-heavy.

Confluent and Kafka co-founder Jay Kreps writes about several use-cases in which Apache Kafka is a good choice for your data's source of truth. These include a centralized log of changes, powering of in-memory cache for online systems, kappa architecture use-cases, and change data capture. Jay argues that Kafka is the commit log for the datacenter but at the same time it won't replace traditional databases—Kafka doesn't plan to support arbitrary queries.

Releases

Qubole has announced general availability of their AIR (Alerts, Insights, Recommendations) service. AIR does usage based ranking and context aware suggestions when writing a query, enables search across column and table names, provides usage reports, statistics, data preview, and provides actionable recommendations for improving data models.

Version 1.5.0 of Apache Kudu has been released. While it's a minor release, there are new features like the ability to tolerate disk failures at startup and improvements to client tools (e.g. exporting CSV files and a tablet move operation). There are also a number of optimizations and bug fixes. The release notes contain some details for anyone considering the upgrade.