Scaling Data Science: Slides from #DDTX17

Stefan Krawczyk

January 23, 2017
- San Francisco, CA

For those who attended my talk at Data Day Texas in Austin last weekend, you heard me talk about how Stitch Fix has reduced contention on:

Access to data

Access to ad-hoc compute resources

to help scale Data Science. As attendees requested, I have posted my slides here, which you can find a link to at the bottom.

For those that weren’t at my talk, here’s a brief background to the slides; they should be relatively self explanatory after reading this background.

Background

At Stitch Fix we have a lot of Data Scientists, around eighty at last count. One reason why we have so many is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they have end to end responsibility for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor, and debug everything and anything required to get the output desired (see Engineers shouldn’t write ETL). They’re full data-stack Data Scientists!

Our Data Scientists are quite prolific at what they do – we’re approaching 4,500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?

This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team provides abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, in which both are incentivized to help each other through feedback: Data Platform needs to understand the Data Scientists’ pain points, while Data Scientists won’t use a tool that doesn’t work for them. The end result is hopefully a well designed tool that appeals to and is adopted by the Data Scientists.

In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:

Access to Data

Access to Compute Resources:

Ad-hoc compute

prototype, iterate, workspace

Production compute

where things are executed once they’re needed regularly

For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.