Lessons from next-generation data wrangling tools

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

Scalability ~ data variety and size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.

Empower domain experts

In many instances, you need subject area experts to explain specific data sets that you’re not familiar with. These experts can place data in context and are usually critical in helping you clean and consolidate variables. Trifacta has tools that enable non-programmers to take on data wrangling tasks that used to require a fair amount of scripting.

Consider DSLs and visual interfaces

“Programs written in a [domain specific language] (DSL) also have one other important characteristic: they can often be written by non-programmers…a user immersed in a domain already knows the domain semantics. All the DSL designer needs to do is provide a notation to express that semantics.”

I’ve often used regular expressions for data wrangling, only to come back later unable to read the code I wrote (Joe Hellerstein describes regex as “meant for writing & never reading again”). Programs written in DSLs are concise, easier to maintain, and can often be written by non-programmers.

Trifacta designed a “readable” DSL for data wrangling but goes one step further: their users “live in visualizations, not code.” Their elegant visual interface is designed to accomplish most data wrangling tasks, but it also lets users access and modify accompanying scripts written in their DSL (power users can also use regular expressions).

These ideas go beyond data wrangling. Combining DSLs with visual interfaces can open up other aspects of data analysis to non-programmers.

Many data analysis tasks involve a handful of data sources that require painstaking data wrangling along the way. Scripts to automate data preparation are needed for replication and maintenance. Trifacta looks at user behavior and context to produce “utterances” of its DSL, which users can then edit or modify.

Don’t forget about replication

If you believe the adage that data wrangling consumes a lot of time and resources, then it goes without saying that tools like Tamr and Trifacta should produce reusable scripts and track lineage. Other aspects of data science — for example, model building, deployment, and maintenance — need tools with similar capabilities.

Ben Lorica is the Chief Data Scientist at O'Reilly Media, Inc. and is the Program Director of both the Strata Data Conference and the O'Reilly Artificial Intelligence Conference. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.