Rose D'souza, Columbia University Graduate School of Journalism, Class of 2012

So You Want To Be a Data Journalist: Clean Data

The best data visualization should be pretty, fun to play with, efficient and, most importantly, contain accurate information. Clean data is fundamental. All the visual stuff comes after it.

Edward R. Tufte explains how data can analyzed and common mistakes that are made in a chapter called “Data Analysis for Politics and Policy.” For the record, even though he writes about data and policy, it actual make sense and isn’t tied up with too much jargon.

Tufte simplifies his explanations by using an example of comparing death rates from car accidents and whether state required car inspections make a difference in the number of deaths.

You would think that analyzing this data would be uncomplicated, but Tufte makes an important point by warning that there are a number of factors we should consider as the “independent variables”, such as weather conditions, if the number has spiked due to an abnormal situation like a big bus crash that killed many people at once, state population (density), the quality of car inspections, how car accident deaths are recorded, and the percentage of young drivers.

If these other variables are considered, then averaging out the number of deaths per state would produce incorrect data because large or small numbers (the outliers) could skew the data. For example, Michigan authorities record car accidents a different way than Arizona does. (I can’t pretend to known technical car terms, I don’t even have a license.)

Anyways, I realized that gathering and analyzing a clean data set means that I would need to consider a many variables, not just the obvious ones. And, like Tufte says, also keep in mind that even if some important factors can’t be measured, “whatever unknowns remain, the analysis of quantitative data nonetheless can help us learn something about the world – even if it is not the whole story.” In other words, I think Tufte suggests that data journalists should strive to be accurate in the collection and comparison of data by knowing how to compare data (what works and what doesn’t make sense) and also acknowledging that the data may not be complete because of missing variables, but it should at least be clean.