Data Eng Weekly

Hadoop Weekly Issue #241

19 November 2017

Lots of new releases this week, including Kafka, Hadoop, Hive, and Phoenix. Also, Databricks Azure and Azure Cosmos DB compatibility for Cassandra are both in preview, and there are great technical articles covering Kafka, StreamSets, Redshift, and the Dist-Keras deep learning framework.

Technical

The Landoop blog has an overview of their web and API-based tool, Lenses, for exploring data in Kafka. Based around the Lenses SQL Engine, it detects data types from streams and has support for real-time views, batch-queries, functions and "time traveling." There are tabular, tree, and raw views, and a Jupyter integration via the API.

The Qubole blog has a tutorial for how to use the Dist-Keras framework for deep learning as part of a Spark ML pipeline. While a small part of the post is Qubole specific, it's predominantly generally applicable to anyone looking to use Spark for deep learning.

Pivotal writes about how they migrated data between GemFire clusters using Apache Kafka for replication. There are some interesting details, including how they avoided an infinite loop using a technique similar to reverse path forwarding.

A secure design for data access requires making tradeoffs. This post describes how to ensure security for Amazon Redshift with multiple accounts, which has some apparent inconveniences that can actually be automated. The walk-through describes loading data via Apache Spark on Amazon EMR and shuffling data across accounts (via assuming roles with Amazon STS).

Confluent has released version 3.3.1 of their Confluent Platform distribution. There are a number of improvements, including to both enterprise and open source versions (which is now based on Apache Kafka 0.11.0.1 and librdkafka 0.11.1).