Data Eng Weekly

Data Eng Weekly Issue #252

18 February 2018

Lots this week on stream processing including coverage of the Pravega streams system, exactly-once in Apache Flink, new features in Hortonworks Data Flow, getting started with Pivotal Cloud Data Flow, and building an application with Confluent KSQL. Qubole also has a great post on some optimizations they've made to query performance in Presto. In releases, Apache Oozie, Apache Storm, and Apache Flink all have new versions out this week.

Technical

The Pravega project came across my radar for the first time. Open-source and from Dell EMC, it's a distributed system implementing streams with similarities to Kafka and Apache Pulsar. Key differentiators are automatic movement of cold data to HDFS or other tier two storage and auto scaling of segments. That auto scaling functionality is one of the topics of the following post, which also looks at the API for sending to and consuming from Pravega.

This is a good article on how to pick the right tech stack to quickly stand up a data warehouse. AWS provides the plumbing with Kineses, Redshift, Glue, Lambda, and more. Lots of good tips if you go down a similar route or are using these (or related) technologies in AWS.

Qubole has implemented two optimizations for Presto—join reordering and dynamic filtering. The post describes how these improvements are implemented and how they improve performance in certain situations. The article also details performance results from an analysis with TPC-DS queries (2.8-14x speedup and several more queries run to completion than before). While these optimizations are available in Qubole Presto, they're also working with the community to get them into the main branch.

If you're doing any SQL database programming from Scala, Doobie looks like a useful library for writing your JDBC queries. It enables writing of raw SQL queries but has a bunch of functionality, including to convert results to case classes (with checks for types) and to execute prepared statements. This post has a good overview of how to get started.

Another open-source project that's new to me—Arango—is a NoSQL database that supports different types of data, including graph, key/value, and document storage. This post describes the results of some recent benchmarking against postgres, mongodb, neo4j, and orientdb. While there are good disclaimers in the post, it's always important to benchmark with your own use cases and data. With that said, for a single-node use case the results are impressive.

If you're using Pivotal Cloud Foundry or Spring, Spring Cloud Data Flow might be a great way to get started with stream processing. This post gives a brief tour of how to get setup (installing the various components with the cf tools) and build a simple log parsing application with the Data Flow shell.

Apache Flink's checkpointing has provided exactly-once semantics within a Flink application for some time now. With the 1.4.0 release, they've also added the ability to ensure exactly-once delivery to a Apache Kafka or Pravega data sink. This post details the two-phase commit implementation that powers the exactly-once delivery.

Hortonworks Data Flow (HDF) includes the open-source Streaming Analytics Manager for UI-driven definition of streaming applications. HDF 3.1 added a test mode (with fake data) and the ability to unit test these applications. This post describes how to do both.

Confluent has a post that describes (complete with lots of code examples) consuming change data capture events from an Oracle database, applying a number of transformations and aggregations with KSQL, and storing the resulting data in Elasticsearch for analysis with Kibana. There's lots of great stuff in here, but it does a particularly good job of demonstrating the differences between streams and tables as well as between event and processing time.

Sponsor

Hey, Data Eng readers: Which big data company’s location tech is embedded in 125K services and apps, from Apple to Uber? The answer—Foursquare. We have a 16 PB cluster that runs 10k jobs a day on Spark, Scalding and Presto. And we’re looking for engineers.

Datanami predicts that 2018 will be the year of data engineer (I guess I was right to rename the newsletter!). It notes some relevant stats, such as job postings for data engineers outnumbering those of data scientists by 4-5x.

Sponsor

Hello Fresh: Change the way people eat forever. Work with our data technology to deliver healthy meals to millions of customers, with a cutting-edge tech stack (Hadoop, Kafka, Impala, pyspark, AWS, Airflow) and time for personal and engineering development. Click the link for more info on becoming a Data Engineer at Hello Fresh in Berlin!