Post navigation

zuFlow– Query Workflow and Scheduling for Google BigQuery

In 2014, we made a decision to build our core data platform on Google Cloud Platform and one of the products which was critical for the decision was Google BigQuery. The scale at which it enabled us to perform analysis we knew would be critical in long run for our business. Today we have more than 200 unique users performing analysis on a monthly basis.

Once we started using Google BiqQuery at scale we soon realized our analysts needed better tooling around it. The key requests we started getting were

Ability to schedule jobs: Analysts needed to have ability to run queries at regular intervals to generate data and metrics.

Define workflow of queries: Basically analysts wanted to run multiple queries in a sequence and share data across them through temp tables.

Simplified data sharing: Finally it became clear teams needed to share this data generated with other systems. For example download it to leverage in R programs or send it to another system to process through Kafka.

zuFlow Overview

zuFlow is zulily’s a query workflow and scheduling solution for Google BigQuery. There are few key concepts

Job: Job is a executable entity that encompasses multiple queries with a schedule.

Query: SQL statement that can be executed on Google BigQuery

Keyword: Variable defined to be used in the queries

Looper: Ability to define loops like foreach statements.

High Level Design

zuFlow is a web application that enables users to setup jobs and run them either on demand or based on a schedule.

We have integrated the system with off the shelf open source scheduler called SOS. We do plan to migrate this to Airflow in future.

Flowrunner is the brain of the system written in python. It leverages data from config db and executes the queries and stores back the runtime details in the db. Few key capabilities it provides are

Concurrency: We have to manage our concurrency to make sure we are not overloading the system

Retry: In few scenarios based on error codes we retry the queries

Cleanup: It is responsible for cleaning up after the jobs are run including historical data cleanup

zuFlow Experience

Job Viewer: Once logged-in you can see your jobs or you can view all jobs in the system

Creating Job: You can provide it a name, schedule to run and email address of the owner.

Keywords/variables: You can create keywords which you can reuse in your query. This enables analysts to define a parameter and use it in there queries instead of hardcoding values. We also have predefined system keywords for date time handling and making it easier for users to shard tables. Examples:

DateTime: CURRENT_TIMESTAMP_PT, CURRENT_DATE_PT, CURRENT_MONTH_PT, CURRENT_TIME_PT, LAST_RUN_TIMESTAMP_PT, LAST_RUN_TIMESTAMP, LAST_RUN_DATE_PT
BQ Format Pacific date of the last run of this job (will be CURRENT_DATE_PT on first run)

Looping: Very soon after our first release we got requests to add loops. This enables users to define variable and loop through the values.

Query Definition: Now you are ready to write a Google BigQuery query and define where the output will be stored. There are 4 options

BQ Table: In this case you provide BQ table and decide if you want to replace it or append to it. You can also define the output table as temp table and system will clean it up after execution of job is completed.

CSV: If you pick CSV you need to provide GCS location for output

Cloud SQL(MySQL): You can also export to the Cloud SQL.

Kafka: You can provide Kakfa topic name to publish results as messages.

You can define multiple queries and share data across them through temp tables in BQ.

Job Overview: This shows the full definition of the job.

We have thought about open sourcing the solution. Please let us know if you are interested in this system

Share:

Like this:

About Sudhir Hasbe

Sudhir is an accomplished product management leader with over 16 years of experience building industry-leading products at startup and blue-chip companies. He has a proven record of leading product teams across engineering and marketing to deliver business results through customer insights and innovation.