Data Eng Weekly

Hadoop Weekly Issue #188

25 September 2016

Lots of releases this week—CouchDB, Accumulo, Kylin, Osso (a new OSS project from Rocana—but most notably Apache Kudu hit version 1.0. There's a bit less technical content and general news than usual, but that's to be expected. With Strata + Hadoop World taking place this week in NYC, get ready for tons of news in the next issue.

Technical

The Cloudera blog has a post on the recently released Apache Hadoop 3.0.0-alpha1. It describes several of the features of the release, including HDFS erasure coding, v.2 of the YARN Timeline Service, and the shell script rewrite.

This post is a great walkthrough of Apache Drill. It covers a bunch of topics, including: quoting reserved keywords, interpreting/fixing json parse errors, use of subqueries, conveniences for querying csv, a basic overview of Drill's web interface, plugin configuration, querying a rdbms, and analyzing a query plan.

Cloudera has published a post comparing Apache Impala and Amazon Redshift. There's an overview of key differences, but the main focus is a performance and cost comparison. As always, these results shouldn't be viewed as necessarily representative (each dataset is different). With that said, using a TPC-DS derived workload, they show that Impala can often beat Redshift in cost and performance.

This post describes some of the challenges of moving a data science research project into a production data pipeline. The author argues that it's important for developers and data scientists to work together to integrate quickly.

Omid is a transaction manager for Apache HBase that was recently accepted into the Apache Incubator after a proposal from Yahoo. It both provides snapshot isolation guarantees and can be used in high performance environments (supporting over 100k transactions/second).

The Google Cloud Platform blog has highlighted three integrations related to Kafka. The Google Cloud Pub/Sub connectors offer a mechanism for moving data between pub/sub and Kafka, the KafkaIO connector for Apache Beam allows Beam systems to consume from Kafka, and the Kafka to BigQuery connector can be used to mirror data to BigQuery.

Apache Kudu announced version 1.0 this week. The release includes support for HA Kudu Master, a rewritten Apache Spark integration, an official client library for Python, and more. To mark the occasion, the Cloudera blog has an overview of the history of the project and a look at its future.