Data Eng Weekly

Hadoop Weekly Issue #210

26 March 2017

Lots and lots of open-source releases this week—Apache NiFi, Apache Knox, Apache Kudu, Apache Flink, and more (including a new open-source timeseries database). There are also some great technical posts on HDFS erasure encoding, Apache Phoenix, and Amazon Athena/Presto.

Technical

Sendence has written about Wallaroo, their distributed event processing framework. The team plans to open-source soon, but in the meantime this post describes what it is, the core abstractions, key features (like exactly-once processing), and future plans. Impressively, Wallaroo has median processing latencies in the microseconds and 99.99% around 1ms (their example use case is for a trading system). Currently, APIs are in C++ and Pony but support is planned for other languages too.

Hortonworks has the fourth part in their "Data Lake 3.0" series. This part describes the evolution of HDFS storage—specifically the heterogenous storage system introduced in Hadoop 2.3 and the erasure coding implementation that is underway now. The post has a good description of how erasure coding is implemented, and it describes the main practical challenges (like small files and write-throughput overhead).

The team at Sky Gaming and Betting has written about how they use the Confluent Schema Registry with Apache Avro and Apache Kafka to enable decentralized implementations across squads within the organization. They are using Node.js, so there's also an overview of the state of the schema registry for a Node.js client.

The Apache Software Foundation blog has a post on the new Column Mapping and Immutable Data Encoding features of Apache Phoenix 4.10 (more below). In short, the column mapping switches Phoenix to use integers rather than strings for column names, which has a number of advantages (including both significant speedups and space savings of around 40% on a TPC-H benchmark).

Amazon has posted performance tips for Amazon Athena (since Athena uses Presto, many of the tips are applicable outside of Athena, too). There are five tips for storing data (covering partitioning and file formats) and five tips on querying data (e.g. avoiding order by without limit and projecting columns early).

News

Since hearing that the Strata + Hadoop World conference is being renamed Strata Data Conference, I've been curious to hear more about what the feeling was there. Datanami has some detail with a look at the "shift to real-time," the challenges due to the complexity of Hadoop, and the (perceived?) momentum due to all the companies built around Hadoop.

The DBMS2 blog has a great look at the recently announced Cloudera Data Science Workbench. It adds some new details, like the fact that it's Docker-based to allow teams to install whatever software they need and that it's been beta tested by a number of big companies.

Version 0.12.0 of Apache Knox was released. There are a number of improvements and new features in the release, including improved proxy support, a YARN HA implementation of the REST API and UI, and pluggable pre-auth header provider support.

Apache Kudu 1.3.0 was released with a bunch of new featuers—Kerberos authentication, encryption in transit using TLS, coarse-grained authorization, background tasks to clean up old data, and a new crash reporter. There are also several optimizations (such as a switch to LZ4 compression) as part of the release.

Apache Gora, which provides an in-memory data model for several different big data frameworks (including Avro, HBase, MongoDB, Spark, and more), has released version 0.7. The release includes over 80 issue resolutions.

TimeScaleDB is a new, open-source time series database that's built with the Postgres engine. It's currently available in a single-node version, and there's an interesting whitepaper describing its design.