Data Eng Weekly

Data Eng Weekly Issue #270

24 June 2018

Lots of variety in this week's issue—topics include Data Reliability Engineering at Criteo, the history of the Apache Arrow project, stream processing with both Wallaroo and St8Flow (a javascript framework), a parquet backend for SQLite, and a few posts on working with large scale relational databases.

Sponsor

SimpleDataLabs builds Prophecy - a Predictive Analytics Designer for Business Analysts, powered by our DeepWisdom engine. It'll put Predictive Analytics in every Business. We're looking for two Founding Engineers - System Architect to drive SaaS Application React/Scala/Spark/K8s/Cloud and ML Architect who can build MetaLearning in Tensorflow.

Technical

Heap have built a product that captures and analyzes lots of data about how users interact with a website. This post describing their software architecture has some great advice about technical decision making and recommendations for technologies (e.g. to adopt Kafka early).

This tutorial walks through how to run the dataArtisan's platform on the Google Kubernetes Engine. In addition to the basics of running an application on Kubernetes, it covers using Google Cloud Storage for checkpoints and building/using a custom docker image.

This post walks through a solution to a common problem—getting data from an external service (e.g. Google Analytics) into your data platform. It uses the StreamSets data collector with the HTTP Client origin.

Github has migrated from a MySQL high availability strategy based on DNS and virtual IPs to one built on Raft, Consul, and HAProxy. They use orchestrator (a system they built internally) for failure detection and initiating MySQL failover. With this solution, they have very small amounts of downtime (<30s) during a failover for their multi-datacenter MySQL deployment.

The Dremio blog has a look at the architecture behind the Gandiva initiative, which aims to bring speedups to Apache Arrow through LLVM code generation. The post discusses optimizations like vectorization and pipelining. Early work is showing some impressive speedups over the JVM JIT.

PgBouncer fronts PostgreSQL to handle thousands of connections with fewer resources than the builtin PostgreSQL connection pool. This post, the second in a series, describes how to use PgBouncer in a multi-tenant environment in which multiple types of services with different service-level objectives are connecting to the database.

Schibsted has a multi-tenant Presto platform for querying data in S3. This post describes a neat solution to authorization built atop of AWS IAM, their usage of the AWS Glue metastore, how they monitor with Datadog, and their CI & deployment infrastructure built on Docker, Travis, Spinnaker, and FPM.

Historically, there haven't been great tools for ad hoc queries of data stored in avro, orc, and parquet files. A new option is a parquet backend for SQLite, which is both quite helpful for ad hoc introspection as well as highly-performant. This opens up the possibility of an online-service consuming parquet files (via SQLite) to power API endpoints. This post introduces the backend and compares performance to other data formats in SQLite and PostgreSQL.

News

Datanami has coverage of some of the big announcements from Hortonworks and its partners at this past week's DataWorks Summit. Among them are new cloud offerings (including Hortonworks DataFlow in AWS and Microsoft Azure) and a preview of HDP based on Apache Hadoop 3.

The Criteo Labs blog has a great post describing the history of their big data systems and data team, including scaling problems and the principles they've embraced to solve technical challenges. It also introduces the notion of a Data Reliability Engineer, which is a hybrid data engineer and SRE. At Criteo, the team responsible for data tools and keeping systems function at scale falls under the SRE organization.

Dremio has the story of Apache Arrow, which has quickly become an important component in data infrastructure. In addition to history (like the original team that conceived of it), they cover recent developments such as support for GPUs and the Arrow Flight Protocol (which aims to replace ODBC/JDBC for in-memory analytics).