Data Eng Weekly

Data Eng Weekly Issue #260

15 April 2018

A couple of great debugging stories, tips for testing a distributed system in the face of failure, details on the Presto cost-based optimizer, and more great technical posts. There are also several good posts on the future of the big data industry and research communities.

Sponsor

Are you looking to elevate your data culture http://bit.ly/culture-at-netflix? Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.

Technical

TiDB, which is a new distributed database with MySQL compatibility, has written about how they test their system by injecting faults. They describe a number of fault injection tools, and their Kubernetes-based platform for automatically running these tests.

The Uber engineering team has written a post describing their work over the past years to scale HDFS for large data volumes and number of queries. A lot of the scalability challenges are with the NameNode, and they have solutions like physically partitioning clusters (but presenting a unified view via ViewFS) based on usage characteristics, reducing the number of small files, tuning the NameNode garbage collector, and a new read-replica NameNode (called an Observer NameNode).

Using the TPC benchmark, this first post compares Presto with the new cost-based optimizer vs. without it (there are some impressive speedups—half fo queries got a 2-5x speedup). The second post describes how the optimizer works and why this results in significant speedups. It's a good technical article, going into join re-ordering, projections, and optimization considerations.

This post has a great overview of monitoring Apache Kafka consumer lag—the use cases and SLOs at ZipRecruiter, how to use the Kafka CLI tools along with a simple parsing script to get this information, and several example graphs that capture interesting consumer lag events.

Channable has written about debugging a performance issue with a long-running Apache Spark application. The post goes into lots of interesting details on JVM internals (e.g. custom class loaders and garbage collection of classes), Spark internals (e.g. how the driver cleans up broadcast data), and their metrics and monitoring strategy that enabled hunting down these bugs and verifying their fixes.

This is a fun pair of articles about executing C functions as UDFs in BigQuery by compiling the code to web assembly/javascript. Performance starts out pretty slow, but batching the data before UDF call results in some massive speedups. Maybe not the most practical, but a clever solution nonetheless.

While this article is aimed mostly at the DB research community, it has a great overview of accomplishments from the past 10 years (including the rise of the Hadoop ecosystem and research on distributed databases) as well as upcoming research areas (like learned indexes) that might make their way into production systems sooner than later.

The first of these two posts surveys the options for working with immutable event data in Apache Kafka in the face of the GDPR. Of these options, it proposes a solution for implementing the encryption-based solution. The second article has a number of details on implementing this method, which is built using Kafka streams and a Kafka topic containing encryption keys.

Replacing Amazon Redshift with Amazon Athena can provide massive cost savings if performance is still good enough. In this post, the author describes several improvements—including improved file formats and batching queries—to further reduce cost in a use case with relatively small data sets.

This post recaps the Snowflake "Unite the Data Nation" conference. There are some good observations on changes to data architectures (separating compute from storage and IoT adoption) as well as a list of several Snowflake features.

Cloudera held its industry analysts event last week, and this post highlights five questions that are facing Cloudera (and likely several other vendors in the same space). Everything from the the volatility of the stock market to questions about the future of the company (and in some ways, the broader big data ecosystem).

Sponsor

Are you looking to elevate your data culture http://bit.ly/culture-at-netflix? Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.

MapR has released an update to support Apache Drill 1.13. Among the new features is support for SQL on MapR streams. The post also highlights some of the features of the release, including several performance enhancements.