The Data Science Process, Rediscovered

The Data Science Process is a relatively new framework for doing data science. It is compared to previous similar frameworks, and a discussion on process innovation versus repetition is then undertaken.

The Data Science Process

The Data Science Process is a framework for approaching data science tasks, and is crafted by Joe Blitzstein and Hanspeter Pfister of Harvard's CS 109. The goal of CS 109, as per Blitzstein himself, is to introduce students to the overall process of data science investigation, a goal which should provide some insight into the framework itself.

The following is a sample application of Blitzstein & Pfister's framework, regarding skills and tools at each stage, as given by Ryan Fox Squire in his answer:

Stage 1: Ask A Question

Skills: science, domain expertise, curiosity

Tools: your brain, talking to experts, experience

Stage 2: Get the Data

Skills: web scraping, data cleaning, querying databases, CS stuff

Tools: python, pandas

Stage 3: Explore the Data

Skills: Get to know data, develop hypotheses, patterns? anomalies?

Tools: matplotlib, numpy, scipy, pandas, mrjob

Stage 4: Model the Data

Skills: regression, machine learning, validation, big data

Tools: scikits learn, pandas, mrjob, mapreduce

Stage 5: Communicate the Data

Skills: presentation, speaking, visuals, writing

Tools: matplotlib, adobe illustrator, powerpoint/keynote

Squire then (rightfully) concludes that the data science work flow is a non-linear, iterative process, and that there are many skills and tools required to cover the full data science process. Squire also professes that he is fond of the Data Science Process as it stresses both the importance of asking questions to guide your workflow, and the importance of iterating on your questions and research, as one gains familiarity with one's data.