An aspiring postdisciplinarian surfs through the ebbs and flows of the changing research environment

Main menu

Let’s talk about data cleaning

Data cleaning has a bad rep. In fact, it has long been considered the grunt work of the data analysis enterprise. I recently came across a piece of writing in the Harvard Business Review that lamented the amount of time data scientists spend cleaning their data. The author feared that data scientists’ skills were being wasted on the cleaning process when they could be using their time for the analyses we so desperately need them to do.

I’ll admit that I haven’t always loved the process of cleaning data. But my view of the process has evolved significantly over the last few years.

As a survey researcher, my cleaning process used to begin with a tall stack of paper forms. Answers that did not make logical sense during the checking process sparked a trip to the file folders to find the form in question. The forms often held physical evidence of a indecision on the part of the respondent, such as eraser marks or an explanation in the margin, which could not have been reflected properly by the data entry person. We lost this part of the process when we moved to web surveys. It sometimes felt like a web survey left the respondent no way to communicate with the researcher about their unique situations. Data cleaning lost its personalized feel and detective story luster and became routine and tedious.

Despite some of the affordances of the movement to web surveys, much of the cleaning process stayed routed in the old techniques. Each form has its own id number, and the programmers would use those id numbers for corrections

if id=1234567, set var1=5, set var7=62

At this point a “good programmer” would also document the changes for future collaborators

*this person was not actually a forest ranger, and they were born in 1962
if id=1234567, set var1=5, set var7=62

Making these changes grew tedious very quickly, and the process seemed to drag on for ages. The researcher would check the data for a potential errors, scour the records that could hold those errors for any kind of evidence of the respondent’s intentions, and then handle each form one at a time.

My techniques for cleaning data have changed dramatically since those days. My goal is to use id numbers as rarely as possible, but instead to ask myself questions like “how can I tell that these people are not forest rangers?” The answer to these questions evokes a subtley different technique:

* these people are not actually forest rangers
if var7=35 and var1=2 and var10 contains ‘fire fighter’, set var1=5)

This technique requires honing and testing (adjusting the precision and recall), but I’ve found it to be far more efficient, faster, more comprehensive and, most of all- more fun (oh hallelujah!). It makes me wonder whether we have perpetually undercut the quality of the data cleaning we do simply because we hold the process in such low esteem.

So far I have not discussed data cleaning for other types of data. I’m currently working on a corpus of Twitter data, and I don’t see much of a difference in the cleaning process. The data types and programming statements I use are different, but the process is very close. It’s an interesting and challenging process that involves detective work, a better and growing understanding of the intricacies of the dataset, a growing set of programming skills, and a growing understanding of the natural language use in your dataset. The process mirrors the analysis to such a degree that I’m not really sure why it would be such a bad thing for analysts to be involved in data cleaning.

I’d be interested to hear what my readers have to say about this. Is our notion of the value and challenge of data cleaning antiquated? Is data cleaning a burden that an analyst should bear? And why is there so little talk about data cleaning, when we could all stand to learn so much from each other in the way of data structuring code and more?

Post navigation

5 thoughts on “Let’s talk about data cleaning”

Great post, Casey! I think you make an excellent case for the growing importance of data cleaning in the analysts’ skill set. Data cleaners are in some respects similar to editors for good writers — anyone who has worked with a great editor can attest to the insights and clarity that s/he can bring to a project. Of course editors rarely get their due and I suspect it will be the same for data cleaning. Illustrating the evolving importance of data cleaning in data analysis provides the best rationale for its inclusion in the analyst’s responsibilities.
Thank you for posing these important questions!

Sometimes I imagine what it would have been like for highly analytical women who weren’t able to enter the job market. Among other things, I imagine the ways in which things in the kitchen could have been categorized and organized and the degree to which the work could go unnoticed. It’s easy to dismiss tasks as tedious that could actually have some real value if done in creatively efficient ways (That said, I’m not about to organize my kitchen!).

sidenote: My mom inadvertently lent some credence to this notion once by panicking about the lack of alphabetization in her spice cabinet. She always saw the kitchen as a sort of barrier to better things, but that apparently never kept her from leaving her analytical stamp on things.

I don’t think you intended to go “there”, but I will make the point that detail-oriented or organizational work is often performed by women and likely doesn’t get the respect it deserves. You make a great point about two approaches (tedious vs. creative) that I think is true not only for data management but also for other IT problem solving. I don’t know how many times I’ve seen the case in web development of “Plug code/widget/etc. here” rather than stepping back to really define the problem and find better ways of solving it.

I have actually been thinking about the power of metaphor in the realm of data cleaning, especially now that data cleaning has evolved into data wrangling. It’s far harder to call anything with such a sexy title tedious work or a waste of time- and yet the wrangling is still essentially about data formatting and quirks of human behavior.

There is definitely quite a bit of shortcutting and one-size-fits-all approaches to IT problems. I’m dealing with a lot of that on the Twitter study I’m doing now. I’m trying to reconcile the stats that I’ve seen with the data I have and the needs of my audience, and the reason why that reconciliation is such a challenge is precisely because of the shortcutting that has grown so institutionalized so quickly.