Data Eng Weekly

Hadoop Weekly Issue #245

24 December 2017

After debating skipping this week's issue, it turns out there were a lot of great articles to share. Among them, (as is normal for this time of year) there are a couple of year-in-review posts. There are also quite a few great technical posts on Spark, Kafka, LinkedIn's Venice, the YARN capacity scheduler, and more. In releases, Pulsar, HBase, Ampool, and KSQL all unveiled new versions.

Technical

CochroachDB uses Multiversion concurrency control for concurrent access to data, and a transaction queue provides additional support for concurrent transactions. This post demonstrates these concepts with an easy to follow set of diagrams.

Impala can now take advantage fo column statistics when scanning data stored in parquet files. This post describes how it uses the min and max value as well as information stored in dictionaries to skip entire blocks of data during query. There are a few considerations when loading your data, which the post also describes.

Venice is a newer system to replace Voldemort for serving key-value data at LinkedIn. It ingests data in batch via Kafka, which is the focus of this post. In addition, Venice supports importing real-time data to implement the lambda architecture. The post describes some of the considerations for that, too.

This post describes the role that a streaming system, like Apache Kafka, can play in a microservices architecture. It argues that leveraging a streaming system can resolve some of the problems resulting from large amounts of data and interconnectivity that arrises from a microservices architecture.

One of the components of the NATS project is a distributed log similar to Kafka. This post, which is the first in the series, looks at the requirements and tradeoffs to consider in the data storage component of a Kafka-like system. NATS is open-source and written in Go.

The Google Cloud Platform has post highlighting a number of lesser known facts about BigQuery. These include its support for User Defined Functions, several of the enterprise features for identity and access management, cell-level access control, and audit logging.

Releases

Version 1.21.0-incubating of Apache Pulsar was released. Key changes include enhancements to the Kafka API wrapper, an upgrade to Netty, better scalability for large number of topics, and secure replication via TLS.

Version 0.3, the December release, of KSQL was announced. It includes several major features: Avro support, integration with the Confluent Schema Registry, the ability to convert between data formats, ability to join across different formats, and support for basic metrics.

Apache HBase 1.4.0 was released. It includes over 660 issues. Major features include a new shaded client that should improve compatibility, improvements to the rest client, enhanced autorestart capabilities, and improvements to RegionServer metrics.

Version 2.0 of Ampool was released. It includes security enhancements, column statistics, new file formats, and more. Also, a few weeks back, the Ampool Active Data Store was released to AWS marketplace.