This one weird trick will simplify your ETL workflow

Russell Avdek

July 06, 2017
- San Francisco, CA

The first and most important step towards developing a powerful machine learning model is acquiring good data. It doesn’t matter if you’re using a simple logistic regression or the fanciest state-of-the-art neural network to make predictions: If you don’t have rich input, your model will be garbage in, garbage out.

This exposes an unfortunate truth that every hopeful, young data scientist has to come to terms with: Most of the time spent developing a production-ready model is invested in writing ETL that gets your data in a clean format.

In this post aimed at SQL practitioners who would rather spend their time writing Python, I’ll outline a trick that can make your ETL more maintainable and easier to write using one of my favorite libraries, jinja2. R enthusiasts can also follow along, developing analogous patterns using one of the language’s many text templating packages, such as whisker.

Unlike the “one weird trick” from those ridiculous advertisements, ours actually works! Before we dig into the details, I’ll provide a little more motivation.

Writing SQL is a pain

As we’ve written about previously, Stitch Fix engineers do not write ETL for data scientists. Instead they provide tools that empower data scientists to more easily perform computations and deploy applications themselves. In turn data scientists are responsible for the end-to-end development of whatever data product they’re building from data acquisition, to processing and modeling, to feeding their polished results into an API, web app, or analysis. That means most data scientists have to write a decent amount of ETL. Usually, this means writing SQL, and there are a number of fundamentally annoying issues with writing SQL.

Writing a complex query almost always involves writing the same chunks of code over and over again. This means if something does change (say the name of a column in a table your ETL is reading from), you may have to make the same update 10 places in a single query. This makes Python programmers feel slightly nauseous as we’re violating the don’t repeat yourself (DRY) principle.

Spaghettification and hot potatoes

As the project which depends on your ETL becomes increasingly sophisticated and data resources change, SQL files tend to have more and more logic heaped onto them, turning your once elegant code into a swirling mess of ad hoc logic. It then becomes harder and harder for new collaborators to contribute to your project as its learning curve becomes impossibly steep, whence both your code becomes spaghetti and your project becomes a hot potato. Not as tasty as it sounds!

Thinking inside of the box

Think of your development framework as a box within which you’re working. When you begin working within the framework, there is lots of room to explore and grow. As you learn more and become more capable, you start to fill out the box and incremental learnings tend to focus on minutia. If your ability and thirst for learning start to exceed the flexibility of your framework, you’ll feel stifled as you perceive your growth as the walls closing in around you.

For this reason, end-users of a database who are less concerned with its internal workings (eg. most data scientists) can become bored writing SQL and doing database work faster than they would working on, say, advanced modeling or software engineering work. If you’re not feeling inspired with what you’re working on, it’s easy to start ignoring best-practices, accelerating the advancement of spaghettification.

Getting Pythonic with Jinja2

If you’ve ever written a web application with Flask or Django, you might know a couple things about writing HTML:

It’s a pain for a lot of the same reasons that writing SQL is: It’s usually verbose, redundant, and difficult to modularize.

There’s an out-of-the-box solution to make it less painful and more Pythonic!

This out-of-the-box solution is the text templating mini-language jinja21. You can think of it as Python’s built-in str.format on steroids:

unpack variables with {{ ... }}
using Python data structures. The syntax is intuitive and has friendly documentation with a wealth of additional capabilities.

Our “one weird trick” is that we can apply this tool to writing SQL, helping us to stay DRY and improve maintainability.

Example: Implementing a linear model

We’ll create a toy example to show how we can “productionize” a linear regression model typically encountered in stats text books – modeling an individual’s income based on height and other biometric features. While this use-case is unrealistically simple, it will illustrate how jinja2 can make a modeling codebase which requires ETL easy both to rapidly prototype and iterate on.

A functioning version of the finished code can be found in this GitHub repo. For additional examples, check out the Appendix section at the end of this post.

Let’s assume the following setup, which has fake data and a few realistic data-environment issues to illustrate the advantages of applying these tools in an enterprise use-case.

Setup

Our input data:

We have a table called customer_attr of attributes on some customers including income, height, weight, is_male, and income along with a customer id.

We’ll assume all inputs are integers and that each column has some missing values.

Our goal:

We’ll assume income has the most missing values so our goal will be to fill in the gaps by predicting from the other attributes.

We’ll want to write a script which will train the model on available data and write out predictions to a table called predicted_income.

Our environment:

We’ll pretend that customer_attr is too big to fit into memory on the machine which will run our code, but we want to use scikit-learn to create our model. We’ll use a 10% sample from the data where income is not null for training.

For simplicity, we’ll assume we’re provided 2 functions:

query_to_pandas(query) which executes a SQL query against a database and returns the output as a DataFrame.

query_to_table(query, table) which executes a SQL query and writes the output to table in our database.

Benefits of this approach

Marvel at all of the additional query logic we got by adding a single line to our codebase! Our single new line of json produced 9 more lines of SQL. If we were to add a dozen features, our use of jinja2 would save us over 100 lines of SQL code. Because each variable is referenced only once in our codebase, making updates to input data is easy. Clearly this trick is useful for staying DRY and avoiding spaghettification.

As a bonus, we can perform our model evaluation purely in SQL! This prevents us from having to perform costly calculations in-memory (like pulling the entire customer_attr table into memory, which we assumed was impossible). While in practice our linear model may be overly simple for the needs of most predictive tasks, it allows us to get a prototype with minimal software dependencies ready in an extremely short amount of time, so it can start producing business value.

Wrapping up

In this post we demonstrated how a web development tool, jinja2, empowers us with a framework to put together creative solutions to database work2. This “one weird trick”, usually outside of the standard ETL toolkit, allow us to keep our code DRY and make ETL development feel more like standard Python development.

Even if you choose not to incorporate it into your workflow, there is a more important takeaway I’d like to convey: If you find a problem is redundant or just plain uninteresting, it’s far better to innovate or automate your way to a solution than slump into dispassion.

By investing time in learning things outside of your comfort zone, you may find new approaches to old problems with different perspectives. For me, working on a few web apps ended up being more valuable than I’d anticipated.

Appendix: Other novel use-cases

In our example above, we used jinja2 to manage features for a simple regression model. Here are a few other useful applications:

One-hot encoding with LIKE statements

For predictive models which learn from text, we can use text templating to engineer one-hot encoded dummy variables directly from a database.

Pivot tables

While Spark does have a pivot operation, performing pivots or transposes in most flavors of SQL is a non-possibility. Using jinja2, it becomes extremely simple to perform these operations in any version of SQL. These operations in WITH clauses and sub-queries can greatly reduce the complexity and execution time of large SQL scripts.

As an example, lets say we have a table of customer purchase records with columns customer_id, price_paid, and item_type where item_type has a small number of values. We can cleanly present the data on how much each customer spent on each item type as a pivot table: