Tag Archives: hci

I often hear that many of the leading data analysts in the field have PhDs in physics or biology or the like, rather than computer science. Computer scientists are typically interested in methods; physical scientists are interested in data.

Another thing I often hear is that a large fraction of the time spent by analysts — some say the majority of time — involves data preparation and cleaning: transforming formats, rearranging nesting structures, removing outliers, and so on. (If you think this is easy, you’ve never had a stack of ad hoc Excel spreadsheets to load into a stat package or database!)

Putting these together, something is very wrong: high-powered people are wasting most of their time doing low-function work. And the challenge of improving this state of affairs has fallen in the cracks between the analysts and computer scientists.

DataWrangler is a new tool we’re developing to address this problem, which I demo’d today at the O’Reilly Strata Conference. DataWrangler is an intelligent visual data transformation tool that lets users reshape, transform and clean data in an intuitive way that surprises most people who’ve worked with data. As you manipulate data in a grid layout, the tool automatically infers information both about the data, and about your intentions for transforming the data. It’s hard to describe, but the lead researcher on the project — Stanford PhD student Sean Kandel — has a quick video up on the DataWrangler homepage that shows how it works. Sean has put DataWrangler live on the site as well.

Tackling these problems fundamentally requires a hybrid technical strategy. Under the covers, DataWrangler is a heady mix of second-order logic, machine learning methods, and human-computer interaction design methodology. We wrote a research paper about it that will appear in this year’s SIGCHI.

If you’re interested in this space, also have a look at Shankar Raman’s very prescient Potter’s Wheel work from a decade ago, the PADS project at AT&T and Princeton, recent research from Sumit Gulwani at Microsoft Research, and David Huynh’s most excellent Google Refine. All good stuff!

[Update 1/15/2010: this paper was awarded Best Student Paper at ICDE 2010! Congrats to Kuang, Harr and Neil on the well-deserved recognition!]

[Update 11/5/2009: the first paper on Usher will appear in the ICDE 2010 conference.]

Data quality is a big, ungainly problem that gets too little attention in computing research and the technology press. Databases pick up “bad data” — errors, omissions, inconsistencies of various kinds — all through their lifecycle, from initial data entry/acquisition through data transformation and summarization, and through integration of multiple sources.

While writing a survey for the UN on the topic of quantitative data cleaning, I got interested in the dirty roots of the problem: data entry. This led to our recent work on Usher[11/5/2009: Link updated to final version], a toolkit for intelligent data entry forms, led by Kuang Chen.