Playing around with Apache Airflow & BigQuery

My Confession

I have a confession….my crontab is a mess and it’s keeping me up at night….don’t worry, it’s not really keeping me up….but you might know what i mean 🙂

We have been using Google BigQuery as our main data mart (lake or whatever its now called) for almost two years. We have been loving this as it’s super powerful with very little overhead in terms of management and infrastructure.

However one thing i think is missing from Google BigQuery is a tool for managing and orchestrating your ETL jobs. Something like MS Data Factory but for Google Cloud.

Over the last year or so we have built up a few shell scripts to wrap around the bq command line tool and basically take a series of templated sql files, chain them together into a job and then schedule it all via cron. This has actually been working better then i would have expected, mainly because we try to build a lot of redundancy into these jobs where failures won’t kill any live tables and will usually be resolved the next time the job runs in a few hours or so.

However it’s always bugged me a little that this is not really a great way to go about things and as the number of jobs and pipelines has grown it has got a little painful to manage.

Apache Airflow

Recently we have been playing around with Apache Airflow. This post is more about a concrete example of one way we have got it working for a specific use case that i did not really find any obvious existing examples of (there is actually great documentation and lots of examples but there is a layer of Airflow specific concepts and terminology one needs to nerd up on first).

So i won’t talk much about Airflow in general except to give some references i found very good for beginners:

The specific use case i was trying to figure out was around creating a paramaterised pipeline that takes the templated sql files we already have and can run them for each of our lines of businesses or “lob’s” (e.g. variety.com, hollywoodlife.com, wwd.com etc). This is a pretty common ETL design pattern so hopefully others might find this useful.

A concrete example at PMC would be some post processing we do of raw Google Analytics data we get exported into BigQuery each day. The pipeline basically takes the raw [<view_id>.ga_sessions_yyyymmdd] tables for each day, applies some transformations and enrichment’s and then creates a table in BigQuery for each lob for each day. So for example [wwd.ga_data_20170322] would be the end result which is a raw hit level table that we have enriched in various ways. This then is essentially one of our main analytical tables that powers lots of downstream analytics across all lob’s.

Parameterized DAG Example

Below is an example of one way we got this working by having a single collection of SQL template files (to represent the actual steps of processing we want to do in BigQuery) and a .py script to build and define the expanded DAG with parallelism across all lobs while maintaining dependencies between each lob specific task.

Disclaimer: I’m pretty new to Airflow and one thing i’ve found is that due to the programmatic nature of how we build DAGs there is lots of ways to do things (this is actually one of its strongest selling points). I’m sure there are better ways to go about what we are showing here (feel free to add suggestions in the comments 🙂 ).

So here is an example DAG definition python script which lives in it’s own sub folder in our Airflow DAGs folder. (Prettier formatting on Github here). I’ve tried to go overboard on the commenting for line by line clarity.

"""
### My first dag to play around with airflow and bigquery.
"""
# imports
from airflow import DAG
from datetime import datetime, timedelta
# we need to import the bigquery operator - there are lots of cool operators for different tasks and systems, you can also build your own
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
# create a dictionary of default typical args to pass to the dag
default_args = {
'owner': 'airflow',
'depends_on_past': False, # does this dag depend on the previous run of the dag? best practice is to try have dags not depend on state or results of a previous run
'start_date': datetime(2017, 3, 22), # from what date to you want to pretend this dag was born on? by default airflow will try backfill - be careful
'email_on_failure': True, # should we send emails on failure?
'email': ['andrew.maguire@pmc.com'], # who to email if fails i.e me :)
'retries': 1, # if fails how many times should we retry?
'retry_delay': timedelta(minutes=2), # if we need to retry how long should we wait before retrying?
}
# define the dag
dag = DAG('my_bigquery_dag', # give the dag a name
schedule_interval='@once', # define how often you want it to run - you can pass cron expressions here
default_args=default_args # pass the default args defined above or you can override them here if you want this dag to behave a little different
)
# define list of lobs we want to run for
lobs = ["lob001","lob002","lob003"]
# loop through the lob's we want to use to build up our dag
for lob in lobs:
# define the first task, in our case a big query operator
bq_task_1 = BigQueryOperator(
dag = dag, # need to tell airflow that this task belongs to the dag we defined above
task_id='my_bq_task_1_'+lob, # task id's must be uniqe within the dag
bql='my_qry_1.sql', # the actual sql command we want to run on bigquery is in this file in the same folder. it is also templated
params={"lob": lob}, # the sql file above have a template in it for a 'lob' paramater - this is how we pass it in
destination_dataset_table='airflow.'+lob+'_test_task1', # we also in this example want our target table to be lob and task specific
write_disposition='WRITE_TRUNCATE', # drop and recreate this table each time, you could use other options here
bigquery_conn_id='my_gcp_connection' # this is the airflow connection to gcp we defined in the front end. More info here: https://github.com/alexvanboxel/airflow-gcp-examples
)
# add documentation for what this task does - this will be displayed in the Airflow UI
bq_task_1.doc_md = """\
Append a "Hello World!" message string to the table [airflow.<lob>_test_task1]
"""
# define the second task, in our case another big query operator
bq_task_2 = BigQueryOperator(
dag = dag, # need to tell airflow that this task belongs to the dag we defined above
task_id='my_bq_task_2_'+lob, # task id's must be uniqe within the dag
bql='my_qry_2.sql', # the actual sql command we want to run on bigquery is in this file in the same folder. it is also templated
params={"lob": lob}, # the sql file above have a template in it for a 'lob' paramater - this is how we pass it in
destination_dataset_table='airflow.'+lob+'_test_task2', # we also in this example want our target table to be lob and task specific
write_disposition='WRITE_TRUNCATE', # drop and recreate this table each time, you could use other options here
bigquery_conn_id='my_gcp_connection' # this is the airflow connection to gcp we defined in the front end. More info here: https://github.com/alexvanboxel/airflow-gcp-examples
)
# add documentation for what this task does - this will be displayed in the Airflow UI
bq_task_2.doc_md = """\
Append a "Goodbye World!" message string to the table [airflow.<lob>_test_task2]
"""
# set dependencies so for example 'bq_task_2' wont start until 'bq_task_1' is completed with success
bq_task_2.set_upstream(bq_task_1)

So the above file makes reference to our sql template files which actually hold the business logic of what we really want to do in BigQuery, everything above is really just plumbing.

Below is what is in the ‘my_qry_1.sql’ file – its just dummy code for this example. (Again Github link and apologies I don’t know how to properly format these blocks as code).

-- this is just some dummy bql to create a little example table
SELECT
string('{{ params.lob }}') as lob, -- here we define the jinja template for lob
string('{{ ds }}') as airflow_execution_date, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
string('{{ ds_nodash }}') as airflow_execution_date_yyyymmdd, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
string('{{ ts }}') as airflow_execution_timestamp, -- here we leverage predefined variables airflow has more info: http://airflow.readthedocs.io/en/latest/code.html?highlight=ds_nodash#default-variables
current_timestamp() as bq_timestamp, -- get the current time from bigquery so we can see any differences between airflow execution date (in the case of backfill's) as opposed to when we actually ran this code
'Hello Word!' as msg -- just a dummy field

With the above code in a folder within our specified airflow DAGs folder we can see how Airflow picks up this DAG.

Gratuitous Airflow UI screenshots coming right up…

We can see the parallel nature of the same tasks but just broken out for each lob in the graph view.

If we look at the tree view we can see this in another way along with execution status over time.

And finally if we look at the gantt view we can see that we do indeed have the parallelism we were after with task 1 being run concurrently for each lob and then similar concurrency for task 2.

So if we trigger this DAG from the airflow cli with:

$ airflow trigger_dag my_bigquery_dag

We can see the resulting data and tables in BigQuery.

And that’s pretty much it. It’s still early days but i’m hoping the fact we already have templated .sql files at the core of our existing approach should make porting over to Airflow easy enough (especially as most of jobs run in BigQuery so LocalExecutor should be enough).

We might do a follow up post in a month or two to share some deeper learning’s once we’ve used it more. I’m particularly excited about using Airflow pretty much anywhere and everywhere i can. One neat example that jumps to mind would be in machine learning pipelines where we tend to use BigQuery for the data crunching and H2O for the model building and learning, Airflow seems like a great way to more cohesively stitch it all together.

3 Replies to “Playing around with Apache Airflow & BigQuery”

Seems like another approach could be to parameterize and define lob specific dags inside the for loop. That would the create dags just for each lob. This might have some advantages in terms of isolating lobs from each other which might make some sense.

Good post. One potential issue with your follow-up idea is that Airflow will not detect dags that are not defined at the top level. This feature is useful for sub dags, but gets in the way of doing parameterized construction of dags. Another approach I can think of would be to use the DagRun operator to create a dag run per lob.