Data Eng Weekly

Data Eng Weekly Issue #274

22 July 2018

Lots of stream processing coverage this week—Apache Kafka, Wallaroo, Apache Samza, WSO2, and Amazon SQS. There are also a couple of posts on Kubernetes, a presentation with database monitoring best practices, and a look at a distributed configuration system at Facebook. In news, there are two new books that may be of interest and a proposed data ethics checklist.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7 or visit http://dremio.com to learn more.

Technical

The MapR blog has a tutorial describing how to get started with Apache Drill using several different OSes. It includes an example use case of joining JSON data with tables in MySQL and covers debugging some common problems.

This presentation describes the main signals (Concurrency, Error Rate, Latency, and Throughput) that are important to measure when monitoring a database system. It describes how they relate to quality of service, several different ways to track these metrics, and common problems across a few different databases.

This post provides a good overview of the types of challenges there are with running a stateful service on Kubernetes. In this case, they've designed a solution for resilience of the Spark application driver in the face of network partitioning. The recovery process is a bit tricky, and there's an interesting discussion on the referenced PR about this and other possible designs to solve for resilience.

Facebook was hitting scalability limits when using ZooKeeper for storing dynamic configuration. This post describes those challenges and how the system they built to replace it, Location-Aware Distribution, solves for for them.

This tutorial builds a stream processing application in Python using the Wallaroo streaming engine. The application is an e-commerce / marketing system that tracks several types of events and triggers a personalized

In this post, MoEngage writes about how they optimized cost and latency in their SQS pipeline by batching data. They used some interesting techniques to pack an optimal number of messages into each batch while keeping latency low.

New York City subway data is available via a RESTful API as a GTFS Realtime feed. This tutorial builds a system to load that data into Apache Kafka and process arrival data in real time. It's written in python, and the example code is available on github.

Apache Beam now has a Apache Samza runner for executing applications. According to the compatibility matrix, it has pretty good support for the Beam Model (on par with Apache Apex and Apache Spark). This presentation provides more details about the implementation.

This post has an introduction to the WSO2 stream processing framework including two example programs. The examples, of loading data from JMS to MySQL and extracting data from a DB and loading into Kafka, demonstrate the Siddhi Application DSL and the graphical view that comes with WSO2.

Sponsor

Unravel demoed a new, fully automated Spark optimization tool at Spark Summit in San Francisco. They showed how to speed up or improve reliability of any Spark application with a single click. See the demo video or download the slides here.

News

BlueData has announced a new initiative, called BlueK8s to bring stateful distributed systems (such as Hadoop, Spark, and Kafka) to Kubernetes. There's a pre-alpha implementation via an open source project called KubeDirector.

This article introduces a new "checklist for people who are working on data projects." Mostly phrased as questions to consider as part of your project, it covers topics across the data science and data engineering spectrum.

"Streaming Systems" is a new book from O'Reilly. It covers topics like watermarks, exactly-once, streaming joins, and streaming SQL. The book is available for download now and will be available in print in a few weeks.

Releases

Databricks Runtime 4.2 includes new features, improvements, and performance upgrades to Databricks Delta (which is getting closer to general availability). It also includes new features in Structured Streaming and a new SQL Deny command for access control.

Sponsors

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7 or visit http://dremio.com to learn more.

Unravel demoed a new, fully automated Spark optimization tool at Spark Summit in San Francisco. They showed how to speed up or improve reliability of any Spark application with a single click. See the demo video or download the slides here.