Describing the data pipeline: A vocabulary for city data analysis

Recently, I was lucky enough to attend the What Works Cities Summit, a gathering of city leaders dedicated to using data to inform their policy-making. As I walked through the conference, I tried to identify whether each new city official I met was a strategist or practitioner. Strategists include mayors, city planners, police chiefs—officials seeking to use data-driven metrics in decision-making. Practitioners focus on gathering data from noisy and imperfect real-world sources and creating usable products, like dashboards and reports, which decision-makers use to inform their work.

The city data practitioners I met were mostly self-taught. The typical career path involved starting as the database manager, munging data in the basement in the late ’80s or early ’90s. As technology and data became more important in city government, these individuals were elevated and tasked with generating the reports upon which the strategists were depending.

I recognized many of the problems practitioners were tackling. There were questions about data authority, accuracy, objectivity, and coverage, all questions I encounter in my work at Google. It was clear that practitioners had developed many inventive solutions to these challenges, but the lack of a shared vocabulary hampered their ability to discuss best practices. I saw many conversations during the conference in which it took skilled practitioners ten or fifteen minutes of discussion to even reach the point where they could meaningfully share their experiences. I started as a self-taught data munger, and I identify with the practitioners I met. However, at Google I now work in a community of data analysts and data scientists who have developed a formal framework and vocabulary for describing the problems involved in this work.

The basic framework I use to describe the data pipeline consists of four steps:

1. Ingestion refers to the process of finding data and importing it into a database, even if that’s just a spreadsheet. Ingestion can involve reformatting existing files or it can mean, in the worst-case scenario, manually transcribing data from paper to a digital format.

2. Munging and wrangling refers to the often arduous task of getting data ready for analysis. Kind of a catch-all bucket of work, data munging involves untangling all the tricky knots that inevitably form when data is not well cared for. Munging can involve parsing fields that need to be separated, correcting spelling mistakes, dealing with missing data, normalizing data, and ensuring consistency of format throughout a dataset.

3. Computation, analysis, and modeling refers to the work involved in taking cleaned-up data and generating metrics upon which decision makers rely. In the most sophisticated data science shops, this would include building predictive models that correlate data with outcomes. It can also be as simple as writing a basic a formula in a spreadsheet.

4. Reporting, the final step in the data science pipeline, is an often overlooked part of the data analysis pipeline. Helping humans understand the lessons they can derive from analyzed data, as well as the strength or weakness of the data supporting metrics, ensures all decisions based on data are more likely to produce improved performance.

Each step of this process is associated with common challenges, sources of error, and approaches to efficient handling. Adopting this framework and vocabulary would help city analysts share lessons learned, identify best practices, and form a stronger community.