Data Eng Weekly

Data Eng Weekly Issue #276

05 August 2018

Lots of stream processing this week, and a couple of posts on moving data out of Kafka for batch processing. There's also some breadth to the coverage—Apache Atlas, Apache Hadoop 3.1, and several releases.

Sponsor

Built by narwhals, Dremio is the Data-as-a-Service platform that simplifies data engineering and data analytics. Accelerate your queries up to 1,000x. Provide your BI and data science users a self-service experience. Open source.

The Hive metastore supports tables whose partitions have differing serialization formats, which can be quite common as data is converted from a write-friendly format to a read-optimized one. This post walks through Spark internals and execution plans to determine how to get multi-format tables working in Spark.

An interesting look at bol.com's data ingestion journey in building a system to mirror data from Kafka to HDFS. They share some good lessons learned from implementing a Flink-based pipeline, and they describe the ultimate solution.

One of the main new features in Apache Pulsar 2.1 (more about that release below) is tiered storage. With it, data can be automatically or manually migrated to S3 or other blog storage. This post includes more about the new feature.

GO-JEK has written about their data pipeline for taking data from Kafka and putting it in cold storage. They've designed an at-least once delivery system that supports loading new Protobuf definitions without a service restart.

This post is a great resource for understanding how Kafka brokers and clients discover the cluster topology, especially when using private networks and/or IPs. It covers the broker configuration settings, describes several common challenges, and provides examples for Kafka running in Docker and in AWS.

Just over a month ago, Dremio (disclosure: Dremio sponsors Data Eng Weekly) announced the Gandiva initiative to bring LLVM code generation speedups to Apache Arrow. The Gandiva project is now live, and their readme has a great overview of the basics of the code generation and optimizations that it implements so far.

The dataArtisans blog has an overview of Broadcast State in Apache Flink. It's useful for efficiently join a stream of high-volume data with a low-volume source, as is illustrated with the example in their post.

This post has a good overview of the talks at the recent Apache Hadoop meetup in Bangalore that covered Ozone (the blob store api for HDFS), Myntra's data ingestion platform, scaling Hadoop at LinkedIn, Presto performance optimizations, and migrating to Hadoop 3.

Two moderate CVEs in Apache Kafka authentication were disclosed. The first allows clients to impersonate other users and the second allows clients to interfere with replication. Both can be mitigated by upgrading to the latest version on the various branches.

Apache Kafka 2.0.0 is released with a bunch of changes including 40 KIPs. Major changes include improvements to ACLs, OAuth2 bearer tokens for authenticating Kafka brokers, improvements to the replication protocol, several Kafka client improvements, and more.

Faust is a Python Stream Processing library that's based on Kafka Streams. It requires Python 3.6 as it's built on async/await. It supports several different types of aggregation/windowing and includes local-state support via RocksDB. Version 1.0.25 was just released.

Version 2.1.0-incubating of the Apache Pulsar streaming platform is now available. The release has several new features like tiered storage, stateful functions, a Go client, and support for Avro and Protobuf schema support.

Sponsors

Built by narwhals, Dremio is the Data-as-a-Service platform that simplifies data engineering and data analytics. Accelerate your queries up to 1,000x. Provide your BI and data science users a self-service experience. Open source.